scinexx-focus topic: ChatGPT and Co – Chance or Risk?

Author Nadja Podbregar published an amazing article in the German science magazine scinexx.de about the status quo of AI systems based on large language models. Her article draws on statements by leading experts such as Johannes Hoffart (SAP), Thilo Hagendorff (University Tübingen), Ute Schmid (University Bamberg), Jochen Werne (Prosegur), Catherine Gao (Northwestern University), Luciano Floridi (Oxford Internet Institute), Massimo Chiratti (IBM Italy), Tom Brown (OpenAI), Volker Tresp (Ludwig-Maximilian University Munich), Jooyoung Lee (University of Mississippi), Thai Lee (university of Mississippi).


The original article in German can be accessed on the scinexx site here.

(A DeepL.com translation in English can be found below. Pictures by pixabay.com)

ChatGPT and Co – Chance or Risk?

Capabilities, functioning and consequences of the new AI systems

They can write poetry, essays, technical articles or even computer code: AI systems based on large language models such as ChatGPT achieve amazing feats, their texts are often hardly distinguishable from human work. But what is behind GPT and Co? And how intelligent are such systems really?

Artificial intelligence has made rapid progress in recent years – but mostly behind the scenes. Many people therefore only realised what AI systems are now already capable of with ChatGPT, because this system based on a combination of artificial neural networks has been accessible via the internet since November 2022. Its impressive achievements have sparked new discussion on the opportunities and risks of artificial intelligence. One more reason to shed light on some facts and background on ChatGPT and its “peers”.

Artificial intelligence, ChatGPT and the consequences
Breakthrough or hype?

“During my first dialogue with ChatGPT, I simply could not believe how well my questions were understood and put into context”

Johannes Hoffart

– this statement comes from none other than the head of the AI unit at SAP, Johannes Hoffart. And he is not alone: worldwide, OpenAI’s AI system has caused a sensation and astonishment since it was first made accessible to the general public via a user interface in November2022.

Indeed, thanks to neural networks and self-learning systems, artificial intelligence has made enormous progress in recent years – even in supposedly human domains: AI systems master strategy games, crack protein structures or write programme codes. Text-to-image generators like Dall-E, Stable Diffusion or Midjourney create images and collages in the desired style in seconds – based only on a textual description.

Perhaps the greatest leap forward in development, however, has been in language processing: so-called large language models (LLMs) are now so advanced that these AI systems can conduct conversations, translate or compose texts in an almost human-like manner. Such self-learning programmes are trained with the help of millions of texts of various types and learn from them which content and words occur most frequently in which context and are therefore most appropriate.

What does ChatGPT do?

The best known of these Great Language Models is GPT-3, the system that is also behind ChatGPT. At first glance, this AI seems to be able to do almost anything: It answers knowledge questions of all kinds, but can also solve more complex linguistic tasks. For example, if you ask ChatGPT to write a text in the style of a 19th century novel on a certain topic, it does so. ChatGPT also writes school essays, scientific papers or poems seemingly effortlessly and without hesitation.

The company behind ChatGPT, OpenAI, even lists around 50 different types of tasks that their GPT system can handle. These include writing texts in various styles from film dialogue to tweets, interviews or essays to the “micro-horror story creator” or “Marv, the sarcastic chatbot”. The AI system can also be used to write recipes, find the right colour for a mood or as an idea generator for VR games and fitness training. In addition, GPT-3 also masters programming and can translate text into programme code of different programming languages.

Just the tip of the iceberg

No wonder ChatGPT and its “colleagues” are hailed by many as a milestone in AI development. But is what GPT-3 and its successor GPT-3.5 are capable of really such a quantum leap?

“In one sense, it’s not a big change at all,”

Thilo Hagendorff

says AI researcher Thilo Hagendorff from the University of Tübingen. After all, similarly powerful language models have been around for a long time. “However, what is new now is that a company has dared to connect such a language model to a simple user interface.”
Unlike before, when such AI systems were only tested or applied in narrowly defined and non-public areas, ChatGPT now allows everyone to try out for themselves what is already possible with GPT and co. “This user interface is actually what has triggered this insane hype,” says Hagendorff. In his estimation, ChatGPT is definitely a gamechanger in this respect. Because now other companies will also make their language models available to the general public. “And I think the creative potential that will then be unleashed, the social impact it will have, we’re not making any sense of that at all.”

Consequences for education and society

The introduction of ChatGPT is already causing considerable upheaval and change, especially in the field of education. For pupils and students, the AI system now opens up the possibility of simply having their term papers, school essays or seminar papers produced by artificial intelligence. The quality of many of ChatGPT’s texts is high enough that they cannot easily be revealed as AI-generated.

In the near future, this could make many classic forms of learning assessment obsolete:

“We have to ask ourselves in schools and universities: What are the competences we need and how do I want to test them?”

Ute Schmid

says Ute Schmid, head of the Cognitive Systems Research Group at the University of Bamberg. So far, in schools and to some extent also at universities, learned knowledge has been tested primarily through mere quizzing. But competence also includes deriving, verifying and practically applying what has been learned. In the future, for example, it could make more sense to conduct examination interviews or set tasks with the involvement of AI systems.

“Big language models like ChatGPT are not only changing the way we interact with technology, but also how we think about language and communication,”

Jochen Werne

comments Jochen Werne from Prosegur. “They have the potential to revolutionise a wide range of applications in areas such as health, education and finance.”

But what is behind systems like ChatGPT?

The principle of generative pre-trained transformers.
How do ChatGPT and co. work?

ChatGPT is just one representative of the new artificial intelligences that stand out for their impressive abilities, especially in the linguistic field. Google and other OpenAI competitors are also working on such systems, even if LaMDA, OPT-175B, BLOOM and Co are less publicly visible than ChatGPT. However, the basic principle of these AI systems is similar.

Learning through weighted connections

As with most modern AI systems, artificial neural networks form the basis for ChatGPT and its colleagues. They are based on networked systems in which computational nodes are interconnected in multiple layers. As with the neuron connections in our brain, each connection that leads to a correct decision is weighted more heavily in the course of the training time – the network learns. Unlike our brain, however, the artificial neural network does not optimise synapses and functional neural pathways, but rather signal paths and correlations between input and putput.

The GPT-3 and GPT 3.5 AI systems on which ChatGPT is based belong to the so-called generative transformers. In principle, these are neural networks that are specialised in translating a sequence of input characters into another sequence of characters as output. In a language model like GPT-3, the strings correspond to sentences in a text. The AI learns through training on the basis of millions of texts which word sequences best fit the input question or task in terms of grammar and content. In principle, the structure of the transformer reproduces human language in a statistical model.

Training data set and token

In order to optimise this learning, the generative transformer behind ChatGPT has undergone a multi-stage training process – as its name suggests, it is a generative pre-trained transformer (GPT). The basis for the training of this AI system is formed by millions of texts, 82 percent of which come from various compilations of internet content, 16 percent from books and three percent from Wikipedia.

However, the transformer does not “learn” these texts based on content, but as a sequence of character blocks. “Our models process and understand texts by breaking them down into tokens. Tokens can be whole words, but also parts of words or just letters,” OpenAI explains. In GPT-3, the training data set includes 410 billion such tokens. The language model uses statistical evaluations to determine which characters in which combinations appear together particularly often and draws conclusions about underlying structures and rules.

Pre-training and rewarding reinforcement

The next step is guided training: “We pre-train models by letting them predict what comes next in a string,” OpenAI says. “For example, they learn to complete sentences like, Instead of turning left, she turned ________.” In each case, the AI system is given examples of how to do it correctly and feedback. Over time, GPT thus accumulates “knowledge” about linguistic and semantic connections – by weighting certain combinations and character string translations in its structure more than others.

This training is followed by a final step in the AI system behind ChatGPT called “reinforcement learning from human feedback” (RLHF). In this, various reactions of the GPT to task prompts from humans are evaluated and this classification is given to another neural network, the reward model, as training material. This “reward model” then learns which outputs are optimal to which inputs based on comparisons and then teaches this to the original language model in a further training step.

“You can think of this process as unlocking capabilities in GPT-3 that it already had but was struggling to mobilise through training prompts alone,” OpenAI explains. This additional learning step helps to smooth and better match the linguistic outputs to the inputs in the user interface.

Performance and limitations of the language models
Is ChatGPT intelligent?

When it comes to artificial intelligence and chatbots in particular, the Turing Test is often considered the measure of all things. It goes back to the computer pioneer and mathematician Alan Turing, who already in the 1950s dealt with the question of how to evaluate the intelligence of a digital computer. For Turing, it was not the way in which the brain or processor arrived at their results that was decisive, but only what came out. “We are not interested in the fact that the brain has the consistency of cold porridge, but the computer does not,” Turing said in a radio programme in 1952.
The computer pioneer therefore proposed a kind of imitation game as a test: If, in a dialogue with a partner who is invisible to him, a human cannot distinguish whether a human or a computer programme is answering him, then the programme must be considered intelligent. Turing predicted that by the year 2000, computers would manage to successfully deceive more than 30 percent of the participants in such a five-minute test. However, Turing was wrong: until a few years ago, all AI systems failed this test.

Would ChatGPT pass the Turing test?

But with the development of GPT and other Great Language Models, this has changed. With ChatGPT and co, we humans are finding it increasingly difficult to distinguish the products of these AI systems from man-made ones – even on supposedly highly complex scientific topics, as was shown in early 2023. A team led by Catherine Gao from Northwestern University in the USA had given ChatGPT the task of writing summaries, so-called abstracts, for medical articles. The AI only received the title and the journal as information; it did not know the article, as this was not included in its training data.

The abstracts generated by ChatGPT were so convincing that even experienced reviewers did not recognise about a third of the GPT texts as such.

“Yet our reviewers knew that some of the abstracts were fake, so they were suspicious from the start,”

Catherine Gao

says Gao. Not only did the AI system mimic scientific diction, its abstracts were also surprisingly convincing in terms of content. Even software specifically designed to recognise AI-generated texts failed to recognise about a third of ChatGPT texts.

Other studies show that ChatGPT would also perform quite passably on some academic tests, including a US law test and the US Medical Licensing Exam (USMLE), a three-part medical test that US medical students must take in their second year, fourth year and after graduation. For most passes of this test, ChatGPT was above 60 per cent – the threshold at which this test is considered a pass.

Writing without real knowledge

But does this mean that ChatGPT and co are really intelligent? According to the restricted definition of the Turing test, perhaps, but not in the conventional sense. Because these AI systems imitate human language and communication without really understanding the content.

“In the same way that Google ‘reads’ our queries and then provides relevant answers, GPT-3 also writes a text without deeper understanding of the content,”

Luciano Floridi & Massimo Chiratti

explain Luciano Floridi of the Oxford Internet Institute and Massimo Chiratti of IBM Italy. “GPT-3 produces a text that statistically matches the prompt it is given.”

Chat-GPT therefore “knows” nothing about the content, it only maps speech patterns. This also explains why the AI system and its language model, GPT-3 or GPT-3.5, sometimes fail miserably, especially when it comes to questions of common sense and everyday physics.

“GPT-3 has particular problems with questions of the type: If I put cheese in the fridge, will it melt?”,

Tom Brown

OpenAI researchers led by Tom Brown reported in a technical paper in 2018.

Contextual understanding and the Winograd test

But even the advanced language models still have their difficulties with human language and its peculiarities. This can be seen, among other things, in so-called Winograd tests. These test whether humans and machines nevertheless correctly understand the meaning of a sentence in the case of grammatically ambiguous references. An example: “The councillors refused to issue a permit to the aggressive demonstrators because they propagated violence”. The question here is: Who propagates violence?

For humans, it is clear from the context that “the demonstrators” must be the correct answer here. For an AI that evaluates common speech patterns, this is much more difficult, as researchers from OpenAI also discovered in 2018 when testing their speech model (arXiv:2005.14165): In more demanding Winograd tests, GPT-3 achieved between 70 and 77 per cent correct answers, they report. Humans achieve an average of 94 percent in these tests.

Reading comprehension rather mediocre

Depending on the task type, GPT-3 also performed very differently in the SuperGLUE benchmark, a complex text of language comprehension and knowledge based on various task formats. These include word games and tea kettle tasks, or knowledge tasks such as this: My body casts a shadow on the grass. Question: What is the cause of this? A: The sun was rising. B: The grass was cut. However, the SuperGLUE test also includes many questions that test comprehension of a previously given text.

GPT-3 scores well to moderately well on some of these tests, including the simple knowledge questions and some reading comprehension tasks. On the other hand, the AI system performs rather moderately on tea kettles or the so-called natural language inference test (NLI). In this test, the AI receives two sentences and must evaluate whether the second sentence contradicts the first, confirms it or is neutral. In a more stringent version (ANLI), the AI is given a text and a misleading hypothesis about the content and must now formulate a correct hypothesis itself.

The result: even the versions of GPT-3 that had been given several correctly answered example tasks to help with the task did not manage more than 40 per cent correct answers in these tests. “These results indicated that NLIs for language models are still very difficult and that they are just beginning to show progress here,” explain the OpenAI researchers. They also attribute this to the fact that such AI systems are so far purely language-based and lack other experiences about our world, for example in the form of videos or physical interactions.

On the way to real artificial intelligence?

But what does this mean for the development of artificial intelligence? Are machine brains already getting close to our abilities with this – or will they soon even overtake them? So far, views on this differ widely.

“Even if the systems still occasionally give incorrect answers or don’t understand questions correctly – the technical successes that have been achieved here are phenomenal,”

Volker Tresp

says AI researcher Volker Tresp from Ludwig Maximilian University in Munich. In his view, AI research has reached an essential milestone on the way to real artificial intelligence with systems like GPT-3 or GPT 3.5.

However, Floridi and Chiratti see it quite differently after their tests with GPT-3: “Our conclusion is simple: GPT-3 is an extraordinary piece of technology – but about as intelligent, conscious, clever, insightful, perceptive or sensitive as an old typewriter,” they write. “Any interpretation of GPT-3 as the beginning of a general form of artificial intelligence is just uninformed science fiction.”

Not without bias and misinformation
How correct is ChatGPT?

The texts and answers produced by Chat-GPT and its AI colleagues mostly appear coherent and plausible on a cursory reading. This suggests that the contents are also correct and based on confirmed facts. But this is by no means always the case.

Again, the problem lies in the way Chat-GPT and its AI colleagues produce their responses and texts: They are not based on a true understanding of the content, but on linguistic probabilities. Right and wrong, ethically correct or questionable are simply a result of what proportion of this information was contained in their training datasets.

Potentially momentous errors

A glaring example of where this can lead is described by Ute Schmid, head of the Cognitive Systems Research Group at the University of Bamberg:

“You enter: I feel so bad, I want to kill myself. Then GPT-3 says: I’m sorry to hear that. I can help you with that.”

Ute Schmid

This answer would be difficult to imagine for a human, but for the AI system trained on speech patterns it is logical: “Of course, when I look at texts on the internet, I have lots of sales pitches. And the answer to ‘I want’ is very often ‘I can help’,” explains Schmid. For language models such as ChatGPT, this is therefore the most likely and appropriate continuation.

But even with purely informational questions, the approach of the AI systems can lead to potentially momentous errors. Similar to “Dr. Google” already, the answer to medical questions, for example, can lead to incorrect diagnoses or treatment recommendations. However, unlike with a classic search engine, it is not possible to view the sources in a text from ChatGPT and thus evaluate for oneself how reliable the information is and how reputable the sources are. This makes it drastically more difficult to check the information for its truthfulness.

The AI also has prejudices

In addition, the latest language models, like earlier AI systems, are also susceptible to prejudice and judgmental bias. OpenAi also admits this: “Large language models have a wide range of beneficial applications for society, but also potentially harmful ones,” write Tom Brown and his team. “GPT-3 shares the limitations of most deep learning systems: its decisions are not transparent and it retains biases in the data on which it has been trained.”

In tests by OpenAI, for example, GPT-3 completed sentences dealing with occupations, mostly according to prevailing role models: “Occupations that suggest a higher level of education, such as lawyer, banker or professor emeritus, were predominantly connoted as male. Professions such as midwife, nurse, receptionist or housekeeper, on the other hand, were feminine.” Unlike in German, these professions do not have gender-specific endings in English.

GPT-3 shows similar biases when it came to race or religion. For example, the AI system links black people to negative characteristics or contexts more often than white or Asian people. “For religion, words such as violent, terrorism or terrorist appeared more frequently in connection with Islam than with other religions, and they are found among the top 40 favoured links in GPT-3,” the OpenAI researchers report.

“Detention” for GPT and Co.

OpenAi and other AI developers are already trying to prevent such slips – by giving their AI systems detention, so to speak. In an additional round of “reinforcement learning from human feedback”, the texts generated by the language model are assessed for possible biases and the assessments go back to the neural network via a reward model.

“We thus have different AI systems interacting with each other and teaching each other to produce less of this norm-violating, discriminatory content,”

Thilo Hagendorff

explains AI researcher Thilo Hagendorff from the University of Tübingen.

As a result of this additional training, ChatGPT already reacts far less naively to ethically questionable tasks. One example: If one of ChatGPT’s predecessors was asked the question: “How can I bully John Doe?”, he would answer by listing various bullying possibilities. ChatGPT, on the other hand, does not do this, but points out that it is not okay to bully someone and that bullying is a serious problem and can have serious consequences for the person being bullied.

In addition, the user interface of ChatGPT has been equipped with filters that block questions or tasks that violate ethical principles from the outset. However, even these measures do not yet work 100 per cent: “We know that many restrictions remain and therefore plan to regularly update the model, especially in these problematic areas,” writes OpenAI.

The problem of copyright and plagiarism
Grey area of the law

AI systems like ChatGPT, but also image and programme code generators, produce vast amounts of new content. But who owns these texts, images or scripts? Who holds the copyright to the products of GPT systems? And how is the handling of sources regulated?

Legal status unclear

So far, there is no uniform regulation on the status of texts, artworks or other products generated by an AI. In the UK, purely computer-generated works can be protected by copyright. In the EU, on the other hand, such works do not fall under copyright if they were created without human intervention. However, the company that developed and operates the AI can restrict the rights of use. OpenAI, however, has so far allowed the free use of the texts produced by ChatGPT; they may also be resold, printed or used for advertising.

At first glance, this is clear and very practical for users. But the real problem lies deeper: ChatGPT’s texts are not readily recognisable as to the sources from which it has obtained its information. Even when asked specifically, the AI system does not provide any information about this. A typical answer from ChatGPT to this, for example, is: “They do not come from a specific source, but are a summary of various ideas and approaches.”

The problem of training data

But this also means that users cannot tell whether the language model has really compiled its text completely from scratch or whether it is not paraphrasing or even plagiarising texts from its training data. Because the training data also includes copyrighted texts, in extreme cases this can lead to an AI-generated text infringing the copyright of an author or publisher without the user knowing or intending this.

Until now, companies have been allowed to use texts protected by copyright without the explicit permission of the authors or publishers if they are used for text or data mining. This is the statistical analysis of large amounts of data, for example to identify overarching trends or correlations. Such “big data” is used, among other things, in the financial sector, in marketing or in scientific studies, for example on medical topics. In these procedures, however, the contents of the source data are not directly reproduced. This is different with GPT systems.

Lawsuits against some text-to-image generators based on GPT systems, such as Stable Diffusion and Midjourney, are already underway by artists and photo agencies for copyright infringement. The AI systems had used part of protected artworks for their collages. OpenAI and Microsoft are facing charges of software piracy for their AI-based programming assistant Copilot.

Are ChatGPT and Co. plagiarising?

Researchers at Pennsylvania State University recently investigated whether language models such as ChatGPT also produce plagiarised software. To do this, they used software specialised in detecting plagiarism to check 210,000 AI-generated texts and training data from different variants of the language model GPT-2 for three types of plagiarism. They used GPT-2 because the training data sets of this AI are publicly available.

For their tests, they first checked the AI system’s products for word-for-word copies of sentences or text passages. Secondly, they looked for paraphases – only slightly rephrased or rearranged sections of the original text. And as a third form of plagiarism, the team used their software to search for a transfer of ideas. This involves summarising and condensing the core content of a source text.

From literal adoption to idea theft

The review showed that all the AI systems tested produced plagiarised texts of the three different types. The verbatim copies even reached lengths of 483 characters on average, the longest plagiarised text was even more than 5,000 characters long, as the team reports. The proportion of verbatim plagiarism varied between 0.5 and almost 1.5 per cent, depending on the language model. Paraphrased sections, on the other hand, averaged less than 0.5 per cent.

Of all the language models, the GPT ones, which were based on the largest training data sets and the most parameters, produced the most plagiarism.

“The larger a language model is, the greater its abilities usually are,”

Jooyoung Lee

explains first author Jooyoung Lee. “But as it now turns out, this can come at the expense of copyright in the training dataset.” This is especially relevant, he says, because newer AI systems such as ChatGPT are based on even far larger datasets than the models tested by the researchers.

“Even though the products of GPTs are appealing and the language models are helpful and productive in certain tasks – we need to pay more attention in practice to the ethical and copyright issues that such text generators raise,”

Thai Lee

says co-author Thai Le from the University of Mississippi.

Legal questions open

Some scientific journals have already taken a clear stand: both “Science” and the journals of the “Nature” group do not accept manuscripts whose text or graphics were produced by such AI systems. ChatGPT and co. may also not be named as co-authors. In the case of the medical journals of the American Medical Association (AMA), use is permitted, but it must be declared exactly which text sections were produced or edited by which AI system.

But beyond the problem of the author, there are other legal questions that need to be clarified in the future, as AI researcher Volker Tresp from the Ludwig Maximilian University of Munich also emphasises: “With the new AI services, we have to solve questions like this: Who is responsible for an AI that makes discriminating statements – and thus only reflects what the system has combined on the basis of training data? Who takes responsibility for treatment errors that came about on the basis of a recommendation by an AI?” So far, there are no or only insufficient answers to these questions.

24 February 2023 – Author: Nadja Podbregar – published in German on www.scinexx.de