AI from an Open Perspective

The way AI burst onto the scene at the beginning of 2023 captured my attention as it did everyone else. I’ve written a longish blog post called AI From an Open Perspective that follows my deep dive exploration of AI through the lens of open.

I talk about Source Data used to train AI including the origins of that data, legality of use, and its implications. I talk about open licensing including Creative Commons, Open Source Software licenses and the new Responsible AI Licenses (RAIL). I created a diagram of the AI Tech stack and describe the ways open plays out across different layers of the stack. I talk about AI Ethics and open. I talk about AI Regulation and open. I talk about the pedagogical ways AI is being trained. And finally I generate a visual depiction of the AI Ecosystem and provide specific recommendations for ensuring open is a vibrant part of that ecosystem.

Ambitious I know, but AI is complex, opens role in it evolving, and my interest is in understanding the big picture. I learned a huge amount in writing this post and invite you to join me in understanding AI From an Open Perspective.

Discussion, comments and suggestions welcome in reply.


I look forward to reading, Paul!

I’m especially glad you noted David Wiley’s idea “What if… we only wrote structured collections of highly crafted prompts? Instead of reading a static textbook in a linear fashion, the learner would use the prompts to interact with a large language model.”

That’s a brilliant suggestion.

From my perspective, AI is going to have its most important role in education as a translation tool. There are over a hundred languages in just India and Africa with more than a million speakers which means there’s more than 50 million students who are going to be getting significant boosts in their learning when AI learns their mother tongue. Literature will be enhanced by that, not hindered.

Dan, I too think the idea of structured prompt collections is intriguing.

I’m not sure I see it as a replacement for linear reading though. However, I can see how it would be useful as a support for guided exploration and discovery. I also think a learner could be asked to generate their own structured collections of prompts along with short summaries of what gets generated, analysis of that output, and decision making on where to go next.

Lots of pedagogical possibilities.

There certainly is a huge need for more diverse language representation in online digital content, of all kinds, including literature. I will be watching with great interest AI’s use for translation and the extent to which it emerges as a means of bolstering education for learners all around the world.

Thanks for this detailed and encyclopedic piece of (human) writing, Paul. There’s lots to take in. I wonder, in my earnest wish to generate discussion here, if you might be “open” to perhaps a series of regular conversations to dive deeper into the topics covered. Here comes some rambling thoughts.

I admit and am guessing that many feel this way, the overwhelming sense of not intuitively understanding or really feeling my mental models stretched of what “training” really means or all the steps the generation goes through.

What you wrote here encapsulates my fuzziness:

It’s crucially important to understand that there is a fundamental difference between reproducing content and generating content. Generative AI doesn’t reproduce content from source data it generates new content.

The way it “generates” new content is really quite far from things we experienced, its not copying per se nor is it sampling… it’s really different.

It’s helpful to see your designation of where openness intersects the stacks and layers.

With what I feel is my own vaporous understanding, what seems to happen is we are left to draw conclusions from our own experiences, what the generators spit out.

If we can do anything, it’s to engage and share and question together. My brain feels stretched and tired trying to get my own understanding in order. I share Dan’s sense that uses in translation seem a positive area (and in my use of tools for transcribing audio I have felt assurance). I also echo the inferences in David Wiley’s “open” question about what an OER might mean as not solely a fixed entity of content, but something more…, well I don’t want to say “alive”.

Also you set up an interesting concept that challenges what many of us started out as a

basic premise is that open is not synonymous with good.

and that all of this is pushing new conceptions of what openness might be evolving into.

I ask back, is it possible to break down everything you so carefully researched into some smaller questions we can chew on?

Lastly I share my new favorite example of a small-ish LLM thing that offers a refreshingly transparent explanation of what it does– much of the technical details are beyond me, at least it is there to see.

Thanks again, Paul, for this significant work of both research and analysis

1 Like

Alan, thanks for reading my encyclopedic post!
So much to unpack and I confess to still having so much to learn.
Despite the length I feel like I just scratched the surface.

I’d be delighted to participate in a series of regular conversations to dive deeper into the topics covered.

I too have an interest in exploring what training really entails. As educators this ought to be an area we have a vested interest in. I’m keen to engage, and share, and question together.

You ask “is it possible to break down everything you so carefully researched into some smaller questions we can chew on?” I’ll try. Here are a some of the questions I asked myself in exploring this work.

What is the role of “open” in AI?

Source Data
What underlying source data is being used by AI as the basis for responding to human queries and prompts?
How much source data is being used? To what extent does more data lead to higher quality outputs?
Is the underlying source data being used as is or cleaned? Is the underlying source data free of bias and error?
Is it legal to train AI on data scraped from the web?
Is it morally and ethically fair to train AI on the work of others?

Open Licenses
What open licenses are used in AI?
Are open source software licenses and Creative Commons licenses fit for purpose?
Do open licenses need to go beyond IP & copyright?
Do open licenses need to address moral and ethical downstream issues?

AI Tech Stack
What are the technology layers that make up AI?
How are AI technologies associated with each layer made available for use?
What open licenses are used at each layer of the AI tech stack?

What are AI models and why are they so important?
What are the different types of AI models? What is a Foundational model?
Why are models openly licensed? How are models openly licensed?
How are models trained?
What is the potential for uniquely custom models I train myself?

What are AI machine learning, deep learning and neural networks?
What are the learning theories, models and practices being used to train AI?
How does AI use reinforcement learning, supervised learning and self-supervised learning?

How do we ensure AI does not endanger safety and security?
Who is liable if AI generated output causes human harm?
How do we embed ethics into the AI developer and user community?
How do we ensure AI generates public good?
How can we avoid the appropriation of end user data for massive corporate gain?

What are the aspects of AI that could go wrong and how severe are the ramifications?
Should AI be a licensed, regulated industry?
How can regulation be done in a way that is internationally compatible?
What are the criteria that trigger regulation?
To what extent can the AI community self govern?

Geez, I should stop here. But as you can see there are lots of questions. I also note that I by no means answered all these questions. I welcome hearing what other questions you or others have.

Thanks for providing this forum as a space for discussion.

Thanks Paul, looking forward to trying this out-- I know for sure I’d like to maybe start with the questions of source data and maybe licenses, but I’d also like to hear from others.

This reference to another brand new study gets at how complex it is to grasp what happens “inside the box”.

Addition: We created a poll to unpack which topics were of most interest, still open for your input.

Thank you for having initiated these discussion series around the promises and tribulations of AI.

I received earlier this year a mini-grant developing infrastructure and educating our faculty about this technology that some of us seems to be related to the disruptions we experienced with the Internet and some went even far to quote the discovery of electricity.

I have become an AI scholar interested in STEM education and Academic research. I am currently finalizing 3 Moodle courses for faculty.

I agree with idea of documenting use-cases and starting a repository.

I am very excited and ready to engage and sustain the community.

Related to your post’s discussion of licenses:

When I worked at Creative Commons (CC) I advocated for an addition to all Creative Commons licenses that would enable creators to express their intent. My belief was that expression of intent and downstream fulfillment of that intent would lead to more and better sharing.

Today’s post from Creative Commons is picking this up as the idea as “preference signalling”

Yes, I read that post from Catherine Stihler with great interest. I think to some extent the fine granularity expressed through preference signals may be overkill.

When I was working on writing Made With Creative Commons with Sarah Pearson one of the observations we made was that most people saw use of CC licenses as an expression of an interest to share and a recognition that in the digital world we live in this can be done in a way that leads to abundance. The legal distinctions between all the various CC licenses were less of interest and often confusing.

I’d also say that most of those we spoke with around their use of CC licenses were primarily interested in contributing to a public commons and less interested in seeing their work used by big megacorps for profit.

I keep wondering why we don’t have a means (preference signals? licenses?) for explicitly making clear this desire to allow creative works to benefit a commons. Catherine references this in her post saying “If CC is endorsing restrictions in this way we must be clear that our preference is a “commons first” approach.”

I also note that preference signals are different from intention signals. Intention signals express why a creator is sharing a work and what they hope will be the outcome of that sharing. Preference signals express limitations around how creators see their work being used by others, in the context of Catherine’s post, AI.

For AI specifically I keep wondering if we might not benefit from an AI that is by and for the commons rather than just AI offered up by businesses and corporations. I for one would opt in to sharing data and creative works with a commons AI but not so much with AI that will be monetized for exclusive benefit of a megacorp.

LLM’s are built upon a corpus of text. This benefits the dominant languages of the global north. There is not an equivalent corpus of text among the 2000 languages in Africa. Yes, there are some initiatives that I come across. But much like my friends in the global north, I too am a member of a dominant language group, here in South Africa. This means that I am generally unaware of them. But this has to be factored into our conversations, when we talk about LLM’s

Thanks for picking up on this, Derek.

I was thinking of a clever opening to greet you, and by instinct did both a reach for web search for an regionally appropriate expression. While I found many results, and will go with “Molo” it feels good to just say (Xhoso for “how are you doing?”). My question to ChatGPT returned a very chatty response, ultimately suggesting English (sigh, very factual, but not fun)

In South Africa, people speak a variety of languages, with 11 official languages recognized in the country. These languages include English, Afrikaans, isiZulu, isiXhosa, Sesotho, Setswana, Sepedi, isiNdebele, Xitsonga, Tshivenda, and SiSwati. Since you’re not sure of the recipient’s first language, it’s a good idea to use a greeting that is widely understood and accepted across South Africa. English is one of the most commonly spoken languages in the country and is widely used for communication, so “Hello” or “Hi” in English is a safe and appropriate choice.

You can simply start your message with “Hello” or “Hi,” which should be well-received by most South Africans regardless of their language preference. If you want to add a personal touch or show respect for their culture and language, you can also consider learning a basic greeting in one or more of the other official languages, but using English as a default is generally acceptable and polite.

Is this so? It seems so… bland.

Part of my own dilemma in this space continues operating from mostly inference, results, and guessing as conceptually how these work under the hood. And also being closed systems with a sprinkling of “open” in brand names, do we really have influence as to expand the training?

And yes, a first order thought is, and for all I know might be workable, is to create an an LLM or run one trained on known sources. But can it be as effective with the order of magnitude (and computing energy) that the big ones do? A better corpus ought to produce better results, has this what is seen in the initiatives you mention?

I have also read that a different approach is “engineering” the way the large systems work-- the analog (maybe wrong?) that we do not necessarily try to build our own web search platforms but try to slice the ones we have to serve more appropriate results.

I am eager to hear more from Dan’s work, as he has described the positive results from using translation AI (though trained, likely not the same can of beans as LLMs) for producing content in languages such like isiXhosa.

Looking forward to learning more from Dan’s project and hearing first hand how it is being used this week: