@omarkamali on Hugging Face: "Another month, another Wikipedia Monthly release! 🎃 Highlights of October's…"

Hugging Face

Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Back to feed

omarkamali

posted an update 27 days ago

Post

254

Another month, another Wikipedia Monthly release! 🎃

Highlights of October's edition:
· 🗣️ 341 languages
· 📚 64.7M articles (+2.5%)
· 📦 89.4GB of data (+3.3%)

We are now sampling a random subset of each language with a reservoir sampling method to produce splits 1000, 5000, and 10000 in addition to the existing train split that contains all the data.

Now you can load the english (or your favorite language) subset in seconds:
dataset = load_dataset("omarkamali/wikipedia-monthly", "latest.en", split="10000")

Happy data engineering! 🧰

omarkamali/wikipedia-monthly

MarcusLammers

27 days ago

Great update! 🔥 Quick question , is there any way to filter the Wikipedia subsets by topic (e.g. science, history, tech), or is it all random sampling per language?

omarkamali

27 days ago

Hey @MarcusLammers , this is a great idea! I will try to include it in a future iteration based on Wikipedia categories.

In this post