burtenshaw
commited on
Commit
·
d12ff68
1
Parent(s):
adb4caa
simplify readme
Browse files
README.md
CHANGED
|
@@ -61,35 +61,6 @@ After deduplication completes:
|
|
| 61 |
|
| 62 |
The dataset will be saved as `your-username/dataset-name` and be publicly available.
|
| 63 |
|
| 64 |
-
## Technical Details
|
| 65 |
-
|
| 66 |
-
- **Embedding Model**: Uses `minishlab/potion-base-8M` (Model2Vec) for fast, efficient text embeddings
|
| 67 |
-
- **Deduplication Algorithm**: SemHash for scalable semantic similarity detection
|
| 68 |
-
- **Backend**: Runs on CPU (may be slow for large datasets on free tier)
|
| 69 |
-
|
| 70 |
-
## Local Usage
|
| 71 |
-
|
| 72 |
-
For faster processing of large datasets, run locally:
|
| 73 |
-
|
| 74 |
-
```bash
|
| 75 |
-
git clone <repository-url>
|
| 76 |
-
cd semantic-deduplication
|
| 77 |
-
pip install -r requirements.txt
|
| 78 |
-
python app.py
|
| 79 |
-
```
|
| 80 |
-
|
| 81 |
-
## Examples
|
| 82 |
-
|
| 83 |
-
### Cross-dataset Deduplication
|
| 84 |
-
Remove test set contamination:
|
| 85 |
-
- **Dataset 1**: `your-org/training-data` (split: `train`)
|
| 86 |
-
- **Dataset 2**: `your-org/test-data` (split: `test`)
|
| 87 |
-
- **Result**: Clean test set with training examples removed
|
| 88 |
-
|
| 89 |
-
### Single Dataset Cleaning
|
| 90 |
-
Remove duplicates from a dataset:
|
| 91 |
-
- **Dataset 1**: `common_voice` (split: `train`)
|
| 92 |
-
- **Result**: Training set with duplicate audio transcriptions removed
|
| 93 |
|
| 94 |
## Notes
|
| 95 |
|
|
|
|
| 61 |
|
| 62 |
The dataset will be saved as `your-username/dataset-name` and be publicly available.
|
| 63 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 64 |
|
| 65 |
## Notes
|
| 66 |
|