Spaces:

minishlab
/

semantic-deduplication

Running

App Files Files Community

burtenshaw commited on Jun 3

Commit

074bcd7

1 Parent(s): f6c9d95

add friendly readme

Browse files

Files changed (1) hide show

README.md +86 -1

README.md CHANGED Viewed

@@ -9,6 +9,91 @@ app_file: app.py
 pinned: false
 license: mit
 short_description: Deduplicate HuggingFace datasets in seconds
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 pinned: false
 license: mit
 short_description: Deduplicate HuggingFace datasets in seconds
+hf_oauth: true
+hf_oauth_scopes:
+  - write-repo
+  - manage-repo
 ---
+# Semantic Text Deduplication Using SemHash
+This Gradio application performs **semantic deduplication** on HuggingFace datasets using [SemHash](https://github.com/MinishLab/semhash) with [Model2Vec](https://github.com/MinishLab/model2vec) embeddings.
+## Features
+- **Two deduplication modes**:
+  - **Single dataset**: Find and remove duplicates within one dataset
+  - **Cross-dataset**: Remove entries from Dataset 2 that are similar to entries in Dataset 1
+- **Customizable similarity threshold**: Control how strict the deduplication should be (0.0 = very loose, 1.0 = exact matches only)
+- **Detailed results**: View statistics and examples of found duplicates with word-level differences highlighted
+- **Hub Integration**: 🆕 **Push deduplicated datasets directly to the Hugging Face Hub** after logging in
+## How to Use
+### 1. Choose Deduplication Type
+- **Cross-dataset**: Useful for removing training data contamination from test sets
+- **Single dataset**: Clean up duplicate entries within a single dataset
+### 2. Configure Datasets
+- Enter the HuggingFace dataset names (e.g., `SetFit/amazon_massive_scenario_en-US`)
+- Specify the dataset splits (e.g., `train`, `test`, `validation`)
+- Set the text column name (usually `text`, `sentence`, or `content`)
+### 3. Set Similarity Threshold
+- **0.9** (default): Good balance between precision and recall
+- **Higher values** (0.95-0.99): More conservative, only removes very similar texts
+- **Lower values** (0.7-0.85): More aggressive, may remove semantically similar but different texts
+### 4. Run Deduplication
+Click **"Deduplicate"** to start the process. You'll see:
+- Loading progress for datasets
+- Deduplication progress
+- Results with statistics and example duplicates
+### 5. Push to Hub (New!)
+After deduplication completes:
+1. **Log in** with your Hugging Face account using the login button
+2. Enter a **dataset name** for your cleaned dataset
+3. Click **"Push to Hub"** to upload the deduplicated dataset
+The dataset will be saved as `your-username/dataset-name` and be publicly available.
+## Technical Details
+- **Embedding Model**: Uses `minishlab/potion-base-8M` (Model2Vec) for fast, efficient text embeddings
+- **Deduplication Algorithm**: SemHash for scalable semantic similarity detection
+- **Backend**: Runs on CPU (may be slow for large datasets on free tier)
+## Local Usage
+For faster processing of large datasets, run locally:
+```bash
+git clone <repository-url>
+cd semantic-deduplication
+pip install -r requirements.txt
+python app.py
+```
+## Examples
+### Cross-dataset Deduplication
+Remove test set contamination:
+- **Dataset 1**: `your-org/training-data` (split: `train`)
+- **Dataset 2**: `your-org/test-data` (split: `test`)
+- **Result**: Clean test set with training examples removed
+### Single Dataset Cleaning
+Remove duplicates from a dataset:
+- **Dataset 1**: `common_voice` (split: `train`)
+- **Result**: Training set with duplicate audio transcriptions removed
+## Notes
+- The app preserves all original columns from the datasets
+- Only the text similarity is used for deduplication decisions
+- Deduplicated datasets maintain the same structure as the original
+- OAuth login is required only for pushing to the Hub, not for deduplication