Spaces:
Runtime error
Runtime error
Add README.md configuration
Browse files
README.md
CHANGED
|
@@ -1,3 +1,13 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
# SummVis
|
| 2 |
|
| 3 |
SummVis is an open-source visualization tool that supports fine-grained analysis of summarization models, data, and evaluation
|
|
@@ -95,14 +105,7 @@ is omitted for copyright reasons). The `preprocessing.py` script can be used for
|
|
| 95 |
|
| 96 |
#### Deanonymize 10 examples:
|
| 97 |
```shell
|
| 98 |
-
python preprocessing.py
|
| 99 |
-
--deanonymize \
|
| 100 |
-
--dataset_rg preprocessing/cnn_dailymail_1000.validation.anonymized \
|
| 101 |
-
--dataset cnn_dailymail \
|
| 102 |
-
--version 3.0.0 \
|
| 103 |
-
--split validation \
|
| 104 |
-
--processed_dataset_path data/10:cnn_dailymail_1000.validation \
|
| 105 |
-
--n_samples 10
|
| 106 |
```
|
| 107 |
This will take either a few seconds or a few minutes depending on whether you've previously loaded CNN/DailyMail from
|
| 108 |
the Datasets library.
|
|
@@ -149,48 +152,22 @@ Set the `--n_samples` argument and name the `--processed_dataset_path` output fi
|
|
| 149 |
|
| 150 |
#### Example: Deanonymize 100 examples from CNN / Daily Mail:
|
| 151 |
```shell
|
| 152 |
-
python preprocessing.py
|
| 153 |
-
--deanonymize \
|
| 154 |
-
--dataset_rg preprocessing/cnn_dailymail_1000.validation.anonymized \
|
| 155 |
-
--dataset cnn_dailymail \
|
| 156 |
-
--version 3.0.0 \
|
| 157 |
-
--split validation \
|
| 158 |
-
--processed_dataset_path data/100:cnn_dailymail_1000.validation \
|
| 159 |
-
--n_samples 100
|
| 160 |
```
|
| 161 |
|
| 162 |
#### Example: Deanonymize all pre-loaded examples from CNN / Daily Mail (1000 examples dataset):
|
| 163 |
```shell
|
| 164 |
-
python preprocessing.py
|
| 165 |
-
--deanonymize \
|
| 166 |
-
--dataset_rg preprocessing/cnn_dailymail_1000.validation.anonymized \
|
| 167 |
-
--dataset cnn_dailymail \
|
| 168 |
-
--version 3.0.0 \
|
| 169 |
-
--split validation \
|
| 170 |
-
--processed_dataset_path data/full:cnn_dailymail_1000.validation \
|
| 171 |
-
--n_samples 1000
|
| 172 |
```
|
| 173 |
|
| 174 |
#### Example: Deanonymize all pre-loaded examples from CNN / Daily Mail (full dataset):
|
| 175 |
```shell
|
| 176 |
-
python preprocessing.py
|
| 177 |
-
--deanonymize \
|
| 178 |
-
--dataset_rg preprocessing/cnn_dailymail.validation.anonymized \
|
| 179 |
-
--dataset cnn_dailymail \
|
| 180 |
-
--version 3.0.0 \
|
| 181 |
-
--split validation \
|
| 182 |
-
--processed_dataset_path data/full:cnn_dailymail.validation
|
| 183 |
```
|
| 184 |
|
| 185 |
#### Example: Deanonymize all pre-loaded examples from XSum (1000 examples dataset):
|
| 186 |
```shell
|
| 187 |
-
python preprocessing.py
|
| 188 |
-
--deanonymize \
|
| 189 |
-
--dataset_rg preprocessing/xsum_1000.validation.anonymized \
|
| 190 |
-
--dataset xsum \
|
| 191 |
-
--split validation \
|
| 192 |
-
--processed_dataset_path data/full:xsum_1000.validation \
|
| 193 |
-
--n_samples 1000
|
| 194 |
```
|
| 195 |
|
| 196 |
### 3. Run SummVis
|
|
@@ -244,10 +221,7 @@ You may run `preprocessing.py` to precompute all data required in the interface
|
|
| 244 |
|
| 245 |
1. Run preprocessing script to generate cache file
|
| 246 |
```shell
|
| 247 |
-
python preprocessing.py
|
| 248 |
-
--workflow \
|
| 249 |
-
--dataset_jsonl path/to/my_dataset.jsonl \
|
| 250 |
-
--processed_dataset_path path/to/my_cache_file
|
| 251 |
```
|
| 252 |
You may wish to first try it with a subset of your data by adding the following argument: `--n_samples <number_of_samples>`.
|
| 253 |
|
|
@@ -278,20 +252,12 @@ standardized format with columns for `document` and `summary:reference`.
|
|
| 278 |
|
| 279 |
##### Example: Save CNN / Daily Mail validation split to disk as a jsonl file.
|
| 280 |
```shell
|
| 281 |
-
python preprocessing.py
|
| 282 |
-
--standardize \
|
| 283 |
-
--dataset cnn_dailymail \
|
| 284 |
-
--version 3.0.0 \
|
| 285 |
-
--split validation \
|
| 286 |
-
--save_jsonl_path preprocessing/cnn_dailymail.validation.jsonl
|
| 287 |
```
|
| 288 |
|
| 289 |
##### Example: Load custom `my_dataset.jsonl`, standardize, and save.
|
| 290 |
```shell
|
| 291 |
-
python preprocessing.py
|
| 292 |
-
--standardize \
|
| 293 |
-
--dataset_jsonl path/to/my_dataset.jsonl \
|
| 294 |
-
--save_jsonl_path preprocessing/my_dataset.jsonl
|
| 295 |
```
|
| 296 |
|
| 297 |
Expected format of `my_dataset.jsonl`:
|
|
@@ -313,17 +279,7 @@ You may also generate your own predictions using this [this script](generation.p
|
|
| 313 |
|
| 314 |
##### Example: Add 6 prediction files for PEGASUS and BART to the dataset.
|
| 315 |
```shell
|
| 316 |
-
python preprocessing.py
|
| 317 |
-
--join_predictions \
|
| 318 |
-
--dataset_jsonl preprocessing/cnn_dailymail.validation.jsonl \
|
| 319 |
-
--prediction_jsonls \
|
| 320 |
-
predictions/bart-cnndm.cnndm.validation.results.anonymized \
|
| 321 |
-
predictions/bart-xsum.cnndm.validation.results.anonymized \
|
| 322 |
-
predictions/pegasus-cnndm.cnndm.validation.results.anonymized \
|
| 323 |
-
predictions/pegasus-multinews.cnndm.validation.results.anonymized \
|
| 324 |
-
predictions/pegasus-newsroom.cnndm.validation.results.anonymized \
|
| 325 |
-
predictions/pegasus-xsum.cnndm.validation.results.anonymized \
|
| 326 |
-
--save_jsonl_path preprocessing/cnn_dailymail.validation.jsonl
|
| 327 |
```
|
| 328 |
|
| 329 |
#### 3. Run the preprocessing workflow and save the dataset.
|
|
@@ -333,19 +289,12 @@ and stores the processed dataset back to disk.
|
|
| 333 |
|
| 334 |
##### Example: Autorun with default settings on a few examples to try it.
|
| 335 |
```shell
|
| 336 |
-
python preprocessing.py
|
| 337 |
-
--workflow \
|
| 338 |
-
--dataset_jsonl preprocessing/cnn_dailymail.validation.jsonl \
|
| 339 |
-
--processed_dataset_path data/cnn_dailymail.validation \
|
| 340 |
-
--try_it
|
| 341 |
```
|
| 342 |
|
| 343 |
##### Example: Autorun with default settings on all examples.
|
| 344 |
```shell
|
| 345 |
-
python preprocessing.py
|
| 346 |
-
--workflow \
|
| 347 |
-
--dataset_jsonl preprocessing/cnn_dailymail.validation.jsonl \
|
| 348 |
-
--processed_dataset_path data/cnn_dailymail
|
| 349 |
```
|
| 350 |
|
| 351 |
|
|
|
|
| 1 |
+
---
|
| 2 |
+
title: Summvis
|
| 3 |
+
emoji: 📚
|
| 4 |
+
colorFrom: yellow
|
| 5 |
+
colorTo: green
|
| 6 |
+
sdk: streamlit
|
| 7 |
+
app_file: app.py
|
| 8 |
+
pinned: false
|
| 9 |
+
---
|
| 10 |
+
|
| 11 |
# SummVis
|
| 12 |
|
| 13 |
SummVis is an open-source visualization tool that supports fine-grained analysis of summarization models, data, and evaluation
|
|
|
|
| 105 |
|
| 106 |
#### Deanonymize 10 examples:
|
| 107 |
```shell
|
| 108 |
+
python preprocessing.py \\n--deanonymize \\n--dataset_rg preprocessing/cnn_dailymail_1000.validation.anonymized \\n--dataset cnn_dailymail \\n--version 3.0.0 \\n--split validation \\n--processed_dataset_path data/10:cnn_dailymail_1000.validation \\n--n_samples 10
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 109 |
```
|
| 110 |
This will take either a few seconds or a few minutes depending on whether you've previously loaded CNN/DailyMail from
|
| 111 |
the Datasets library.
|
|
|
|
| 152 |
|
| 153 |
#### Example: Deanonymize 100 examples from CNN / Daily Mail:
|
| 154 |
```shell
|
| 155 |
+
python preprocessing.py \\n--deanonymize \\n--dataset_rg preprocessing/cnn_dailymail_1000.validation.anonymized \\n--dataset cnn_dailymail \\n--version 3.0.0 \\n--split validation \\n--processed_dataset_path data/100:cnn_dailymail_1000.validation \\n--n_samples 100
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 156 |
```
|
| 157 |
|
| 158 |
#### Example: Deanonymize all pre-loaded examples from CNN / Daily Mail (1000 examples dataset):
|
| 159 |
```shell
|
| 160 |
+
python preprocessing.py \\n--deanonymize \\n--dataset_rg preprocessing/cnn_dailymail_1000.validation.anonymized \\n--dataset cnn_dailymail \\n--version 3.0.0 \\n--split validation \\n--processed_dataset_path data/full:cnn_dailymail_1000.validation \\n--n_samples 1000
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 161 |
```
|
| 162 |
|
| 163 |
#### Example: Deanonymize all pre-loaded examples from CNN / Daily Mail (full dataset):
|
| 164 |
```shell
|
| 165 |
+
python preprocessing.py \\n--deanonymize \\n--dataset_rg preprocessing/cnn_dailymail.validation.anonymized \\n--dataset cnn_dailymail \\n--version 3.0.0 \\n--split validation \\n--processed_dataset_path data/full:cnn_dailymail.validation
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 166 |
```
|
| 167 |
|
| 168 |
#### Example: Deanonymize all pre-loaded examples from XSum (1000 examples dataset):
|
| 169 |
```shell
|
| 170 |
+
python preprocessing.py \\n--deanonymize \\n--dataset_rg preprocessing/xsum_1000.validation.anonymized \\n--dataset xsum \\n--split validation \\n--processed_dataset_path data/full:xsum_1000.validation \\n--n_samples 1000
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 171 |
```
|
| 172 |
|
| 173 |
### 3. Run SummVis
|
|
|
|
| 221 |
|
| 222 |
1. Run preprocessing script to generate cache file
|
| 223 |
```shell
|
| 224 |
+
python preprocessing.py \\n --workflow \\n --dataset_jsonl path/to/my_dataset.jsonl \\n --processed_dataset_path path/to/my_cache_file
|
|
|
|
|
|
|
|
|
|
| 225 |
```
|
| 226 |
You may wish to first try it with a subset of your data by adding the following argument: `--n_samples <number_of_samples>`.
|
| 227 |
|
|
|
|
| 252 |
|
| 253 |
##### Example: Save CNN / Daily Mail validation split to disk as a jsonl file.
|
| 254 |
```shell
|
| 255 |
+
python preprocessing.py \\n--standardize \\n--dataset cnn_dailymail \\n--version 3.0.0 \\n--split validation \\n--save_jsonl_path preprocessing/cnn_dailymail.validation.jsonl
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 256 |
```
|
| 257 |
|
| 258 |
##### Example: Load custom `my_dataset.jsonl`, standardize, and save.
|
| 259 |
```shell
|
| 260 |
+
python preprocessing.py \\n--standardize \\n--dataset_jsonl path/to/my_dataset.jsonl \\n--save_jsonl_path preprocessing/my_dataset.jsonl
|
|
|
|
|
|
|
|
|
|
| 261 |
```
|
| 262 |
|
| 263 |
Expected format of `my_dataset.jsonl`:
|
|
|
|
| 279 |
|
| 280 |
##### Example: Add 6 prediction files for PEGASUS and BART to the dataset.
|
| 281 |
```shell
|
| 282 |
+
python preprocessing.py \\n--join_predictions \\n--dataset_jsonl preprocessing/cnn_dailymail.validation.jsonl \\n--prediction_jsonls \\npredictions/bart-cnndm.cnndm.validation.results.anonymized \\npredictions/bart-xsum.cnndm.validation.results.anonymized \\npredictions/pegasus-cnndm.cnndm.validation.results.anonymized \\npredictions/pegasus-multinews.cnndm.validation.results.anonymized \\npredictions/pegasus-newsroom.cnndm.validation.results.anonymized \\npredictions/pegasus-xsum.cnndm.validation.results.anonymized \\n--save_jsonl_path preprocessing/cnn_dailymail.validation.jsonl
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 283 |
```
|
| 284 |
|
| 285 |
#### 3. Run the preprocessing workflow and save the dataset.
|
|
|
|
| 289 |
|
| 290 |
##### Example: Autorun with default settings on a few examples to try it.
|
| 291 |
```shell
|
| 292 |
+
python preprocessing.py \\n--workflow \\n--dataset_jsonl preprocessing/cnn_dailymail.validation.jsonl \\n--processed_dataset_path data/cnn_dailymail.validation \\n--try_it
|
|
|
|
|
|
|
|
|
|
|
|
|
| 293 |
```
|
| 294 |
|
| 295 |
##### Example: Autorun with default settings on all examples.
|
| 296 |
```shell
|
| 297 |
+
python preprocessing.py \\n--workflow \\n--dataset_jsonl preprocessing/cnn_dailymail.validation.jsonl \\n--processed_dataset_path data/cnn_dailymail
|
|
|
|
|
|
|
|
|
|
| 298 |
```
|
| 299 |
|
| 300 |
|