radoslavralev commited on
Commit
94ef3eb
·
verified ·
1 Parent(s): 7fe55cc

Add new SentenceTransformer model

Browse files
Files changed (2) hide show
  1. README.md +70 -71
  2. config.json +1 -1
README.md CHANGED
@@ -12,51 +12,50 @@ tags:
12
  - retrieval
13
  - reranking
14
  - generated_from_trainer
15
- - dataset_size:9233417
16
  - loss:ArcFaceInBatchLoss
17
  base_model: answerdotai/ModernBERT-large
18
  widget:
19
- - source_sentence: Hayley Vaughan portrayed Ripa on the ABC daytime soap opera , ``
20
- All My Children `` , between 1990 and 2002 .
21
  sentences:
22
- - Traxxpad is a music application for Sony 's PlayStation Portable published by
23
- Definitive Studios and developed by Eidos Interactive .
24
- - Between 1990 and 2002 , Hayley Vaughan Ripa portrayed in the ABC soap opera ``
25
- All My Children `` .
26
- - Between 1990 and 2002 , Ripa Hayley portrayed Vaughan in the ABC soap opera ``
27
- All My Children `` .
28
- - source_sentence: Olivella monilifera is a species of dwarf sea snail , small gastropod
29
- mollusk in the family Olivellidae , the marine olives .
30
  sentences:
31
- - Olivella monilifera is a species of the dwarf - sea snail , small gastropod mollusk
32
- in the Olivellidae family , the marine olives .
33
- - He was cut by the Browns after being signed by the Bills in 2013 . He was later
34
- released .
35
- - Olivella monilifera is a kind of sea snail , marine gastropod mollusk in the Olivellidae
36
- family , the dwarf olives .
37
- - source_sentence: Hayashi said that Mackey `` is a sort of `` of the original model
38
- for Tenchi .
39
  sentences:
40
- - In the summer of 2009 , Ellick shot a documentary about Malala Yousafzai .
41
- - Hayashi said that Mackey is `` sort of `` the original model for Tenchi .
42
- - Mackey said that Hayashi is `` sort of `` the original model for Tenchi .
43
- - source_sentence: Much of the film was shot on location in Los Angeles and in nearby
44
- Burbank and Glendale .
 
 
 
45
  sentences:
46
- - Much of the film was shot on location in Los Angeles and in nearby Burbank and
47
- Glendale .
48
- - Much of the film was shot on site in Burbank and Glendale and in the nearby Los
49
- Angeles .
50
- - Traxxpad is a music application for the Sony PlayStation Portable developed by
51
- the Definitive Studios and published by Eidos Interactive .
52
- - source_sentence: According to him , the earth is the carrier of his artistic work
53
- , which is only integrated into the creative process by minimal changes .
54
  sentences:
55
- - National players are Bold players .
56
- - According to him , earth is the carrier of his artistic work being integrated
57
- into the creative process only by minimal changes .
58
- - According to him , earth is the carrier of his creative work being integrated
59
- into the artistic process only by minimal changes .
 
60
  datasets:
61
  - redis/langcache-sentencepairs-v2
62
  pipeline_tag: sentence-similarity
@@ -81,28 +80,28 @@ model-index:
81
  type: test
82
  metrics:
83
  - type: cosine_accuracy@1
84
- value: 0.44081091729646477
85
  name: Cosine Accuracy@1
86
  - type: cosine_precision@1
87
- value: 0.44081091729646477
88
  name: Cosine Precision@1
89
  - type: cosine_recall@1
90
- value: 0.42663486382682986
91
  name: Cosine Recall@1
92
  - type: cosine_ndcg@10
93
- value: 0.6274011007244752
94
  name: Cosine Ndcg@10
95
  - type: cosine_mrr@1
96
- value: 0.44081091729646477
97
  name: Cosine Mrr@1
98
  - type: cosine_map@100
99
- value: 0.5749605963252064
100
  name: Cosine Map@100
101
  - type: cosine_auc_precision_cache_hit_ratio
102
- value: 0.27130175854619276
103
  name: Cosine Auc Precision Cache Hit Ratio
104
  - type: cosine_auc_similarity_distribution
105
- value: 0.40770905754259995
106
  name: Cosine Auc Similarity Distribution
107
  ---
108
 
@@ -156,9 +155,9 @@ from sentence_transformers import SentenceTransformer
156
  model = SentenceTransformer("redis/langcache-embed-experimental")
157
  # Run inference
158
  sentences = [
159
- 'According to him , the earth is the carrier of his artistic work , which is only integrated into the creative process by minimal changes .',
160
- 'According to him , earth is the carrier of his artistic work being integrated into the creative process only by minimal changes .',
161
- 'According to him , earth is the carrier of his creative work being integrated into the artistic process only by minimal changes .',
162
  ]
163
  embeddings = model.encode(sentences)
164
  print(embeddings.shape)
@@ -167,9 +166,9 @@ print(embeddings.shape)
167
  # Get the similarity scores for the embeddings
168
  similarities = model.similarity(embeddings, embeddings)
169
  print(similarities)
170
- # tensor([[1.0000, 0.9844, 0.9844],
171
- # [0.9844, 0.9961, 0.9922],
172
- # [0.9844, 0.9922, 0.9961]], dtype=torch.bfloat16)
173
  ```
174
 
175
  <!--
@@ -207,14 +206,14 @@ You can finetune this model on your own dataset.
207
 
208
  | Metric | Value |
209
  |:-------------------------------------|:-----------|
210
- | cosine_accuracy@1 | 0.4408 |
211
- | cosine_precision@1 | 0.4408 |
212
- | cosine_recall@1 | 0.4266 |
213
  | **cosine_ndcg@10** | **0.6274** |
214
- | cosine_mrr@1 | 0.4408 |
215
  | cosine_map@100 | 0.575 |
216
- | cosine_auc_precision_cache_hit_ratio | 0.2713 |
217
- | cosine_auc_similarity_distribution | 0.4077 |
218
 
219
  <!--
220
  ## Bias, Risks and Limitations
@@ -235,19 +234,19 @@ You can finetune this model on your own dataset.
235
  #### LangCache Sentence Pairs (all)
236
 
237
  * Dataset: [LangCache Sentence Pairs (all)](https://huggingface.co/datasets/redis/langcache-sentencepairs-v2)
238
- * Size: 126,938 training samples
239
  * Columns: <code>anchor</code>, <code>positive</code>, and <code>negative</code>
240
  * Approximate statistics based on the first 1000 samples:
241
  | | anchor | positive | negative |
242
  |:--------|:----------------------------------------------------------------------------------|:----------------------------------------------------------------------------------|:----------------------------------------------------------------------------------|
243
  | type | string | string | string |
244
- | details | <ul><li>min: 8 tokens</li><li>mean: 27.27 tokens</li><li>max: 49 tokens</li></ul> | <ul><li>min: 8 tokens</li><li>mean: 27.27 tokens</li><li>max: 48 tokens</li></ul> | <ul><li>min: 7 tokens</li><li>mean: 26.54 tokens</li><li>max: 61 tokens</li></ul> |
245
  * Samples:
246
- | anchor | positive | negative |
247
- |:--------------------------------------------------------------------------------------------------------------------------------------------|:----------------------------------------------------------------------------------------------------------------------------------------------|:----------------------------------------------------------------------------------------------------------------------------------------------|
248
- | <code>The newer Punts are still very much in existence today and race in the same fleets as the older boats .</code> | <code>The newer punts are still very much in existence today and run in the same fleets as the older boats .</code> | <code>how can I get financial freedom as soon as possible?</code> |
249
- | <code>The newer punts are still very much in existence today and run in the same fleets as the older boats .</code> | <code>The newer Punts are still very much in existence today and race in the same fleets as the older boats .</code> | <code>The older Punts are still very much in existence today and race in the same fleets as the newer boats .</code> |
250
- | <code>Turner Valley , was at the Turner Valley Bar N Ranch Airport , southwest of the Turner Valley Bar N Ranch , Alberta , Canada .</code> | <code>Turner Valley , , was located at Turner Valley Bar N Ranch Airport , southwest of Turner Valley Bar N Ranch , Alberta , Canada .</code> | <code>Turner Valley Bar N Ranch Airport , , was located at Turner Valley Bar N Ranch , southwest of Turner Valley , Alberta , Canada .</code> |
251
  * Loss: <code>losses.ArcFaceInBatchLoss</code> with these parameters:
252
  ```json
253
  {
@@ -262,19 +261,19 @@ You can finetune this model on your own dataset.
262
  #### LangCache Sentence Pairs (all)
263
 
264
  * Dataset: [LangCache Sentence Pairs (all)](https://huggingface.co/datasets/redis/langcache-sentencepairs-v2)
265
- * Size: 126,938 evaluation samples
266
  * Columns: <code>anchor</code>, <code>positive</code>, and <code>negative</code>
267
  * Approximate statistics based on the first 1000 samples:
268
  | | anchor | positive | negative |
269
  |:--------|:----------------------------------------------------------------------------------|:----------------------------------------------------------------------------------|:----------------------------------------------------------------------------------|
270
  | type | string | string | string |
271
- | details | <ul><li>min: 8 tokens</li><li>mean: 27.27 tokens</li><li>max: 49 tokens</li></ul> | <ul><li>min: 8 tokens</li><li>mean: 27.27 tokens</li><li>max: 48 tokens</li></ul> | <ul><li>min: 7 tokens</li><li>mean: 26.54 tokens</li><li>max: 61 tokens</li></ul> |
272
  * Samples:
273
- | anchor | positive | negative |
274
- |:--------------------------------------------------------------------------------------------------------------------------------------------|:----------------------------------------------------------------------------------------------------------------------------------------------|:----------------------------------------------------------------------------------------------------------------------------------------------|
275
- | <code>The newer Punts are still very much in existence today and race in the same fleets as the older boats .</code> | <code>The newer punts are still very much in existence today and run in the same fleets as the older boats .</code> | <code>how can I get financial freedom as soon as possible?</code> |
276
- | <code>The newer punts are still very much in existence today and run in the same fleets as the older boats .</code> | <code>The newer Punts are still very much in existence today and race in the same fleets as the older boats .</code> | <code>The older Punts are still very much in existence today and race in the same fleets as the newer boats .</code> |
277
- | <code>Turner Valley , was at the Turner Valley Bar N Ranch Airport , southwest of the Turner Valley Bar N Ranch , Alberta , Canada .</code> | <code>Turner Valley , , was located at Turner Valley Bar N Ranch Airport , southwest of Turner Valley Bar N Ranch , Alberta , Canada .</code> | <code>Turner Valley Bar N Ranch Airport , , was located at Turner Valley Bar N Ranch , southwest of Turner Valley , Alberta , Canada .</code> |
278
  * Loss: <code>losses.ArcFaceInBatchLoss</code> with these parameters:
279
  ```json
280
  {
 
12
  - retrieval
13
  - reranking
14
  - generated_from_trainer
15
+ - dataset_size:3587
16
  - loss:ArcFaceInBatchLoss
17
  base_model: answerdotai/ModernBERT-large
18
  widget:
19
+ - source_sentence: Hunter College was originally Lehman College 's uptown campus .
 
20
  sentences:
21
+ - Acquired programming includes the Irish soap `` Fair City `` and Finnish drama
22
+ `` Black Widows `` .
23
+ - According to the United States Census Bureau , the town has a total area of ;
24
+ of the area is land and 0.66 % is water .
25
+ - Hunter College originally was Lehman College Uptown Campus .
26
+ - source_sentence: He hoped to defeat them and then marry Ravonna .
 
 
27
  sentences:
28
+ - Stillwater Creek received its official name in 1884 when William L. Couch established
29
+ his `` boomer colony `` on its banks .
30
+ - Note that the invertible of a matrix is always an exponential matrix .
31
+ - He hoped to defeat them and marry Ravonna .
32
+ - source_sentence: Born on February 2 , 1984 , Abrar Khan is a professional Pakistani
33
+ international Kabaddi player .
 
 
34
  sentences:
35
+ - Born on February 2 , 1984 , Abrar Khan is a professional Pakistani international
36
+ Kabaddi player .
37
+ - Together , the paired mylohyoid muscles form a muscular floor for the oral cavity
38
+ of the mouth .
39
+ - Abrar Khan born 2 February 1984 is a Pakistani professional international Kabaddi
40
+ player .
41
+ - source_sentence: Certainly , `` Lucy was nothing like flat `` in physical form ,
42
+ social condition , and personality .
43
  sentences:
44
+ - The real number is called the `` imaginary part `` of the real number ; the real
45
+ number is called the `` complex part `` of .
46
+ - From the Celebes lake , the captain Bullock observed the appearance of the corona
47
+ , while Gustav Fritsch accompanied an expedition to Aden .
48
+ - Certainly `` Lucy was , in physical form , social condition and personality ,
49
+ nothing like Shallow `` .
50
+ - source_sentence: The trio has performed besides Gesaffelstein , Justice , Bob Moses
51
+ and Lee Foss .
52
  sentences:
53
+ - The trio has performed besides Gesaffelstein , Justice , Bob Moses and Lee Foss
54
+ .
55
+ - The suttas generally contain educational content , while other early Buddhist
56
+ texts deal with monastic discipline or vinaya .
57
+ - The trio has performed alongside Bob Moses , Justice , Gesaffelstein and Lee Foss
58
+ .
59
  datasets:
60
  - redis/langcache-sentencepairs-v2
61
  pipeline_tag: sentence-similarity
 
80
  type: test
81
  metrics:
82
  - type: cosine_accuracy@1
83
+ value: 0.44070346359110285
84
  name: Cosine Accuracy@1
85
  - type: cosine_precision@1
86
+ value: 0.44070346359110285
87
  name: Cosine Precision@1
88
  - type: cosine_recall@1
89
+ value: 0.42648577181064024
90
  name: Cosine Recall@1
91
  - type: cosine_ndcg@10
92
+ value: 0.627438499402098
93
  name: Cosine Ndcg@10
94
  - type: cosine_mrr@1
95
+ value: 0.44070346359110285
96
  name: Cosine Mrr@1
97
  - type: cosine_map@100
98
+ value: 0.5750186225138979
99
  name: Cosine Map@100
100
  - type: cosine_auc_precision_cache_hit_ratio
101
+ value: 0.27246772094744054
102
  name: Cosine Auc Precision Cache Hit Ratio
103
  - type: cosine_auc_similarity_distribution
104
+ value: 0.40850809564840007
105
  name: Cosine Auc Similarity Distribution
106
  ---
107
 
 
155
  model = SentenceTransformer("redis/langcache-embed-experimental")
156
  # Run inference
157
  sentences = [
158
+ 'The trio has performed besides Gesaffelstein , Justice , Bob Moses and Lee Foss .',
159
+ 'The trio has performed besides Gesaffelstein , Justice , Bob Moses and Lee Foss .',
160
+ 'The trio has performed alongside Bob Moses , Justice , Gesaffelstein and Lee Foss .',
161
  ]
162
  embeddings = model.encode(sentences)
163
  print(embeddings.shape)
 
166
  # Get the similarity scores for the embeddings
167
  similarities = model.similarity(embeddings, embeddings)
168
  print(similarities)
169
+ # tensor([[0.9961, 0.9961, 0.9883],
170
+ # [0.9961, 0.9961, 0.9883],
171
+ # [0.9883, 0.9883, 1.0000]], dtype=torch.bfloat16)
172
  ```
173
 
174
  <!--
 
206
 
207
  | Metric | Value |
208
  |:-------------------------------------|:-----------|
209
+ | cosine_accuracy@1 | 0.4407 |
210
+ | cosine_precision@1 | 0.4407 |
211
+ | cosine_recall@1 | 0.4265 |
212
  | **cosine_ndcg@10** | **0.6274** |
213
+ | cosine_mrr@1 | 0.4407 |
214
  | cosine_map@100 | 0.575 |
215
+ | cosine_auc_precision_cache_hit_ratio | 0.2725 |
216
+ | cosine_auc_similarity_distribution | 0.4085 |
217
 
218
  <!--
219
  ## Bias, Risks and Limitations
 
234
  #### LangCache Sentence Pairs (all)
235
 
236
  * Dataset: [LangCache Sentence Pairs (all)](https://huggingface.co/datasets/redis/langcache-sentencepairs-v2)
237
+ * Size: 1,922 training samples
238
  * Columns: <code>anchor</code>, <code>positive</code>, and <code>negative</code>
239
  * Approximate statistics based on the first 1000 samples:
240
  | | anchor | positive | negative |
241
  |:--------|:----------------------------------------------------------------------------------|:----------------------------------------------------------------------------------|:----------------------------------------------------------------------------------|
242
  | type | string | string | string |
243
+ | details | <ul><li>min: 8 tokens</li><li>mean: 27.26 tokens</li><li>max: 49 tokens</li></ul> | <ul><li>min: 8 tokens</li><li>mean: 27.24 tokens</li><li>max: 49 tokens</li></ul> | <ul><li>min: 9 tokens</li><li>mean: 27.09 tokens</li><li>max: 49 tokens</li></ul> |
244
  * Samples:
245
+ | anchor | positive | negative |
246
+ |:--------------------------------------------------------------------------------------------------------------------------------------------|:--------------------------------------------------------------------------------------------------------------------------------------------|:-----------------------------------------------------------------------------------------------------------------------------------------------------------|
247
+ | <code>The newer Punts are still very much in existence today and race in the same fleets as the older boats .</code> | <code>The newer punts are still very much in existence today and run in the same fleets as the older boats .</code> | <code>At that time , on June 22 , 1754 , Edward Bentham married Bentham Elizabeth Bates ( d . 1790 ) from Hampshire in the nearby county of Alton .</code> |
248
+ | <code>The newer punts are still very much in existence today and run in the same fleets as the older boats .</code> | <code>The newer Punts are still very much in existence today and race in the same fleets as the older boats .</code> | <code>In 2012 , Cornell 5th and Lehigh 8th , Cornell was also 4th in 2013 and 7th in 2014 .</code> |
249
+ | <code>Turner Valley , was at the Turner Valley Bar N Ranch Airport , southwest of the Turner Valley Bar N Ranch , Alberta , Canada .</code> | <code>Turner Valley , was at the Turner Valley Bar N Ranch Airport , southwest of the Turner Valley Bar N Ranch , Alberta , Canada .</code> | <code>Turner Valley Bar N Ranch Airport , , was located at Turner Valley Bar N Ranch , southwest of Turner Valley , Alberta , Canada .</code> |
250
  * Loss: <code>losses.ArcFaceInBatchLoss</code> with these parameters:
251
  ```json
252
  {
 
261
  #### LangCache Sentence Pairs (all)
262
 
263
  * Dataset: [LangCache Sentence Pairs (all)](https://huggingface.co/datasets/redis/langcache-sentencepairs-v2)
264
+ * Size: 1,922 evaluation samples
265
  * Columns: <code>anchor</code>, <code>positive</code>, and <code>negative</code>
266
  * Approximate statistics based on the first 1000 samples:
267
  | | anchor | positive | negative |
268
  |:--------|:----------------------------------------------------------------------------------|:----------------------------------------------------------------------------------|:----------------------------------------------------------------------------------|
269
  | type | string | string | string |
270
+ | details | <ul><li>min: 8 tokens</li><li>mean: 27.26 tokens</li><li>max: 49 tokens</li></ul> | <ul><li>min: 8 tokens</li><li>mean: 27.24 tokens</li><li>max: 49 tokens</li></ul> | <ul><li>min: 9 tokens</li><li>mean: 27.09 tokens</li><li>max: 49 tokens</li></ul> |
271
  * Samples:
272
+ | anchor | positive | negative |
273
+ |:--------------------------------------------------------------------------------------------------------------------------------------------|:--------------------------------------------------------------------------------------------------------------------------------------------|:-----------------------------------------------------------------------------------------------------------------------------------------------------------|
274
+ | <code>The newer Punts are still very much in existence today and race in the same fleets as the older boats .</code> | <code>The newer punts are still very much in existence today and run in the same fleets as the older boats .</code> | <code>At that time , on June 22 , 1754 , Edward Bentham married Bentham Elizabeth Bates ( d . 1790 ) from Hampshire in the nearby county of Alton .</code> |
275
+ | <code>The newer punts are still very much in existence today and run in the same fleets as the older boats .</code> | <code>The newer Punts are still very much in existence today and race in the same fleets as the older boats .</code> | <code>In 2012 , Cornell 5th and Lehigh 8th , Cornell was also 4th in 2013 and 7th in 2014 .</code> |
276
+ | <code>Turner Valley , was at the Turner Valley Bar N Ranch Airport , southwest of the Turner Valley Bar N Ranch , Alberta , Canada .</code> | <code>Turner Valley , was at the Turner Valley Bar N Ranch Airport , southwest of the Turner Valley Bar N Ranch , Alberta , Canada .</code> | <code>Turner Valley Bar N Ranch Airport , , was located at Turner Valley Bar N Ranch , southwest of Turner Valley , Alberta , Canada .</code> |
277
  * Loss: <code>losses.ArcFaceInBatchLoss</code> with these parameters:
278
  ```json
279
  {
config.json CHANGED
@@ -8,7 +8,7 @@
8
  "classifier_activation": "gelu",
9
  "classifier_bias": false,
10
  "classifier_dropout": 0.0,
11
- "classifier_pooling": "mean",
12
  "cls_token_id": 50281,
13
  "decoder_bias": true,
14
  "deterministic_flash_attn": false,
 
8
  "classifier_activation": "gelu",
9
  "classifier_bias": false,
10
  "classifier_dropout": 0.0,
11
+ "classifier_pooling": "cls",
12
  "cls_token_id": 50281,
13
  "decoder_bias": true,
14
  "deterministic_flash_attn": false,