OER Dataset for AI project

At OEGlobal24 I presented an idea for a project that I think we could all contribute to.

Here is a quick video I made if you missed the presentation at #OEGlobal24

The goal here is to discuss whether this should be something OE Global as an organisation should take on.

Some of the issues are:

  • Other aligned projects we can support/join?
  • Strategic partnerships (eg Pressbooks, Wikimedia)?
  • Existing funding enough or do we need new?
  • Negotiate licensing for commercial platforms?
  • Which tools to use (eg MoodleNet/Kolibri/Merlot?)
  • How to empower our community to contribute?
  • What are more ways to increase trust?

Whatā€™s your reaction? Letā€™s talk!

4 Likes

Hi Martin,

Thanks for the fruitful thought shared yesterday. :smile:
I envision that the OER collections must be multilingual because it will affect how AI is trained and performed. It might need some iterative collecting process. We may primarily obtain English-speaking OERs in the first round as a base. To some extent, a dul- or multi-language repositories are to be built with alignment (maybe just like we have in Wikipedia right now).

Ken

1 Like

Youā€™re right, many of the big gaps are in non-English languages.

It would be great to focus on the big 37 with more than 50 million speakers List of languages by total number of speakers - Wikipedia) but these are also the most likely to be tackled by other projects in some way.

The other thousands of languages with very little representation in OER may have more urgency ā€¦?

My gut feeling is that translation (or even aligned language datasets) is less important than good quality primary data (and remember we are talking not just about text, but images, audio, video, and more) ā€¦ translation is more perhaps better tackled later once thereā€™s a corpus of good labelled OERs to work with.

Excellent points about quality primary content! Let me build on that with some practical challenges and opportunities I see:
First, youā€™re absolutely right about the translation dilemma. While translation seems like a quick fix, it often misses cultural nuances and contextual relevance. Iā€™ve seen this firsthand where translated OERs just didnā€™t resonate with local learners.

But I want to push back slightly on deprioritizing the major languages. Perhaps we need a ā€œboth/andā€ rather than ā€œeither/orā€ approach? Hereā€™s why:

  1. The major languages could serve as proving grounds for quality control systems and community engagement models. Success there could inform work with smaller language communities.
  2. Many speakers of smaller languages are actually bilingual with a major language. Building strong OER ecosystems in major languages could provide ā€œinterimā€ access while we develop resources in smaller languages.

That said, I strongly agree with Martinā€™s emphasis on multimodal content. We should be thinking beyond text from the start. In many communities, oral traditions and visual learning are paramount. This includes: Audio recordings of local experts, culturally relevant visuals and examples, Interactive elements that reflect local teaching/learning styles.

On the quality assurance front, assuring quality primary data is tougher than ever. However, Iā€™m cautiously optimistic about community-driven approaches that will bring us trustworthy repositories. OER communities will do a better job than tech giants such as Facebook (LOL). But we need robust frameworks.

Has anyone experimented with peer review systems that combine local language experts, SMEs, community leaders and educatorsā€¦etc.

What do others think about balancing these different priorities? And how might we structure pilot projects to test these ideas?

Martin Dougiamas via OE Global Connect <connect@oeglobal.org>ę–¼ 2024幓11꜈18ę—„ 週äø€ļ¼Œäø‹åˆ4:19åƫ道ļ¼š

1 Like

I just added a little video to the original post above that explains more of the background.

1 Like

Thanks for adding the video Martin, itā€™s pretty much a good representation of the talk you gave.

On the surface this seems rational. We are faced with the opaqueness both the source content and the inner workings of the readily accessible Generative AI tools, so why not gather up all the great OER content to create an LLM where this is not the case.

My first question is how much content is needed? It seems like an extraordinarily large amount-- the ones commercially available have trained on corpus including Wikipedia, and that is just a part of the pie.

And are you considering entire works like open courses, open textbooks, or more granular content?

I think also your diagram suggests we just pour all our existing OER in all their formats, into a giant vat. From the little I understand, the training data needs to be in a structured and uniform format. It would seem like a great deal of processing would be needed, especially to organize into a structured format.

It seems to me the approach might be more efficient to work with the organizations already collating OER - the OER Commons, MERLOTs, Pressbooks, OpenStax, LibreTexts, et al.

This seems to be the work for sure that Open Future is already engaged in AI and the Commons ā€“ Open Future where I am learning about:

  • PD12M ā€œa dataset of 12.4 million high-quality public domain and CC0-licensed images with synthetic captions, designed for training text-to-image models. PD12M is the largest public domain image-text dataset to date, with sufficient size to train foundation models while minimizing copyright concerns. Through the Source.Plus platform, we also introduce novel, community-driven dataset governance mechanisms that reduce harm and support reproducibility over time.ā€ One its own I am happy to discover Source Plus as a search tool for open licensed images, that provides full attribution
  • Structured Media Wikimedia Dataset (Hugging Face) " This dataset contains all articles of the English and French language editions of Wikipedia, pre-parsed and outputted as structured JSON files with a consistent schema (JSONL compressed as zip). Each JSON line holds the content of one full Wikipedia article stripped of extra markdown and non-prose sections (references, etc.)." Uncompressed this dataset is almost 80GB
  • Common Corpus (Hugging Face) ā€œhe largest open and permissible licensed text dataset, comprising over 2 trillion tokens (2,003,039,184,047 tokens). It is a diverse dataset, consisting of books, newspapers, scientific articles, government and legal documents, code, and more.ā€

I might guess your response would be that these are not purely open educational materials- where is the vast pile of truly worthy learning materials. How do we efficiently go about extrating content from open courseware, open educational resources, etc. Getting it into machine readable format is a formidable task.

Cheers thanks Alan.

In the diagram the two databases are meant to separate the collection part (all in original formats) from the re-formatted dataset (in a form suitable for training with). I think we can come up with a mini prototype version of the system without too much trouble that demonstrates how things would look at each stage.

And yes, there are other dataset projects, and maybe we can use them, but I think weā€™d need to evaluate them carefully from the viewpoint of a) are they suitable educational materials and b) what bias they have in terms of culture/language etc.

A lot of this project would be about finding/creating things to help fill in the gaps.

This would be very positive, how do we get started? I wonder if you think of all the formats we might have to deal with, from files (PDF, Word, Audio, Video) to combined larger content (courses, web sites, open textbooks, interactives).

It might help me understand too, Martin, some examples of the types of OER you think ought to be in the said data set, some examples.

I was not thinking as much about using them, thought possible, but just for comparison to look at the size of these collections, and the references within to how they have been analyzed/reviewed/curated.

I also would like to know what you think is a meaningful volume of material. I still remain skeptical some if we just rely on single contributions, it will take a long time to accumulate, we need to rope in full collections of existing resources.

And finally, you pitched this as something ā€œOEGlobalā€ should do, but honestly, the small organization here does not have the bandwidth or expertise to do it, so I would like to think you are saying ā€œOEGlobalā€ collectively as an organization/community.

So where do we start? Do we ask for interested parties to be in a working group?

Me again :wink: Is something like Dagshub worth looking into? It seems to be designed for managing data models and such.

I came across it from looking at Source.plus, which includes the PDM12 data set of over 12 million public domain images, and seems to be build for not only finding pd content but building and inspecting data sets. They also seem to have some kind of community governance or inpit model for reporting on data issues.

Yes I was proposing it as something the OEGlobal leadership might consider as making part of the strategy and budget (Board + Executive directors), and if that happens, then the whole membership could be involved.

Iā€™m not proposing to lead it myself, I have too many other things Iā€™m already not spending enough time on. :slight_smile:

Hereā€™s another similar initiative (but again, not focussed on OER specifically):

Dagshub looks interesting, thanks! Will check it out.