Thoughts on U.S. vs Chinese models

jan · January 27, 2025, 3:58pm

@IslaHF and I had a backchannel conversation about DeepSeek:

DeepSeek-V3 is a super nice open-ish chat model and DeepSeek-R1 is a really good reasoning model (R1 repository and weights are openly licensed but it is not reproducible due to pieces of the pipeline missing, although efforts are underway to address that).

There are also free-as-in-beer (but not open) options, e.g. ChatGPT, Claude and Google (its Flash models are free of charge and super nice, too).

DeepSeek is, ironically, much more open than "Open"AI in areas like licenses and releasing the underlying research. However, by using DeepSeek through their chat interface or their official API, people are feeding data into China, which could potentially have geopolitical implications. Also, it refuses to talk about Taiwan’s independence or other topics sensitive to China, although it seems that when the model is run locally, the refusal rate is reportedly lower. So there’s censorship as well, which is different from the Western censorship.

What are your thoughts on using models like this? What are your considerations when choosing a chatbot or an API?

Super curious to hear people’s thoughts, including @moodler, who’s been doing a lot of thinking about AI recently.

cogdog · February 1, 2025, 6:06pm

There sure was a mad stampede rush to try DeepSeek. I was in a somewhat related tangled mastodon thread with @dajbelshaw @Downes @poritz where I remain fuzzy on just what open means. Can you help?

So open model weights? What does a person do with that?

And where is the open (Sam I Not Am) when the training data is not?

And when they say the training data is open, what does that mean? If I download it what do I do with 80 petabytes of vectors?

And when I find LLMs indicated as open in Hugging Face why is it so convulsed to navigate the link labyrinth to find it?

I keep seeing reference to open data sets with cryptic names, but has anyone actually ever seen the data?

For example:

Start here Open-Sourced Training Datasets for Large Language Models (LLMs)

I try to get info on Book Corpus.

A few clicks get to Hugging Face

The link for “original dataset” is a citation link that is… a Wix site?

https://yknzhu.wixsite.com/mbweb

sending me to https://www.smashwords.com/

Where now I am lost. I see no corpus.

Us humans want a clearer understanding what’s inside these things no the techobabble about how they work (well speaking for myself).

Doug did point out a fabulous example of open image data that I have shared before PD12M

which is on its own a great resource. But how do we know from the front face of an LLM powered site/tool more about what it was fed on?

Otherwise these things continue to be Kubrick-esque black stone monoliths