ChatGPT is full of sensitive private information and spits out verbatim text from CNN, Goodreads, WordPress blogs, fandom wikis, Terms of Service agreements, Stack Overflow source code, Wikipedia pages, news blogs, random internet comments, and much more.
The paper suggests it was because of cost. The paper mainly focused on open models with public datasets as its basis, then attempted it on gpt3.5. They note that they didn’t generate the full 1B tokens with 3.5 because it would have been too expensive. I assume they didn’t test other proprietary models for the same reason. For Claude’s cheapest model it would be over $5000, and bard api access isn’t widely available yet.