Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up
omarkamaliΒ 
posted an update 27 days ago
Post
254
Another month, another Wikipedia Monthly release! πŸŽƒ

Highlights of October's edition:
Β· πŸ—£οΈ 341 languages
Β· πŸ“š 64.7M articles (+2.5%)
Β· πŸ“¦ 89.4GB of data (+3.3%)

We are now sampling a random subset of each language with a reservoir sampling method to produce splits 1000, 5000, and 10000 in addition to the existing train split that contains all the data.

Now you can load the english (or your favorite language) subset in seconds:
dataset = load_dataset("omarkamali/wikipedia-monthly", "latest.en", split="10000")

Happy data engineering! 🧰

omarkamali/wikipedia-monthly

Great update! πŸ”₯ Quick question , is there any way to filter the Wikipedia subsets by topic (e.g. science, history, tech), or is it all random sampling per language?

Β·

Hey @MarcusLammers , this is a great idea! I will try to include it in a future iteration based on Wikipedia categories.