The Last Word On Nothing

I used to work at the Science news office in the American Association for the Advancement of Science office in DC. As anyone who’s been there will tell you, there’s something special about the AAAS building — the dark polished rock exterior, the spiral staircase that, as you climb from floor to floor, takes you past framed Science covers illustrating more than a century of scientific achievements.

What AAAS does carries weight. Which is why I was intrigued — very intrigued — to hear last year that the Science Press Package team had decided to run an experiment testing how well ChatGPT Plus could write its press releases.

I would probably have been alarmed by the news if I didn’t know the Science journals’ communications director, Meagan Phelan. I admire Meagan — we’ve been good friends since I worked in the DC office — and you will not find a more thoughtful, intellectually rigorous and caring person. If Meagan signed off on this experiment, I thought, it wasn’t just a reckless bid for efficiency, but a genuine attempt to understand LLMs better.

The year-long experiment began in 2023 and was published in September 2025. Curious how it went, I sent Meagan a message and she suggested I talk to Abigail Eisenstadt, one of the Science Press Package (SciPak) senior writers. Abigail led the study and wrote this white paper to describe its findings. (She also just did an interview about it on the Grammar Girl podcast.) Abigail let me ask her a bunch of questions and listened to my own rambling thoughts and fears about AI-powered science writing, and I came away feeling much better.

There’s lot to be worried about these days — including that generative AI is busily burying us alive in crap. But I am grateful to patient souls like Abigail, who set out to explore in a measured and deliberate way what these tools can and cannot do when we ask them to perform nuanced human tasks. We need thoughtful folks doing these experiments, if only to keep delineating and articulating the precious — and still distinct — boundaries between AI algorithms and our own remarkable human minds.

Here’s what Abigail told me about Science’s ChatGPT Plus experiment, and why it left her feeling more secure in her job, not less.

Note: This conversation was edited for clarity and brevity at 11:45pm, by my sleepy human brain and typo-prone fingers. Cases in point: 1) Mortifyingly, an earlier draft of this misspelled Meagan’s name with an h. 2) An earlier draft of this also failed to distinguish between all of AAAS communications and the Science journals press team. This interview pertains specifically to the Science press team, not the whole organization.

Emily: I know you’ve described this in detail elsewhere, but let’s start with the basics. What were the nuts and bolts of this experiment?

Abigail: Understanding the gist of what we did relies on understanding how we operate our press packages. Every Monday, we start drafting press briefs — three to six of them, depending on the week and our workloads. We write up summaries from our journals, and different writers represent different journals. I do the open access journal, Science Advances, but we also have specialty writers on Science Immunology, Science Robotics and other sibling journals.

What we wanted to do for this study — and this was back in 2023, before a lot of common knowledge about prompt engineering was around — was to see if Chat GPT Plus could use our outline structure, the template we rely on to write our summaries, to write research summaries too.

It wasn’t necessarily a yes or no question. It was — how much can it do it? If it can’t, are there ways it could evolve in the future so that it could?

What we did — because we’re already creating content every week — is take something that had already been published, so the embargo was already past, and run it through Chat GPT Plus, using three different prompts for summaries. We used the privacy mode option, so it wasn’t training anything.

One prompt was for a very general summary: Just tell me what the study is about. The other was more abstract, but in a way that you would hope would be accessible to journalists. The third was inspired by our own very specific press release template.

The LLM would produce these summaries, and I would send them back to the author who had written the human content and ask them to compare it to theirs. And it was very clear from the start of this project that there was no danger of us being supplanted.

Wouldn’t it have been fairer to mix LLM summaries with the human-written ones and have an independent judge do the comparison?

One reason that we didn’t double blind was that we’re all writing so many summaries every single week, this was already an additional step in our workload. I also wanted to have the people who were already intimately aware of the study evaluate it. If I’m looking at a summary that talks about results and it doesn’t mention sample size, and that it’s in mice, I won’t know that unless I read the paper —whereas somebody who’s already covered the study will know to look for that. So that was the trade-off, I guess, you could say that we made.

I would have the writers complete surveys and rank the summaries, from one to five, on the feasibility of it being in our press package, and its compelling-ness. I also gathered qualitative feedback: If you don’t like it, can you tell me why? I think we had one study where the reviewer said something like, “I’m not sure this analyzed the correct study, because this is so off base.” But I thought that must be an outlier, so decided not going to use it too much as an example.

Where did the impetus for the experiment come from? Leadership or staff?

I work closely with Meagan Phelan and Matt Wright on this, and they came to me to ask if I had the bandwidth to run it, and I was happy to. But I think the impetus was that our team likes to be on top of emerging tools and topics. I know Meaghan stays on top of everything happening in the scientific enterprise. I think curiosity is kind of the tenet of our team. That’s the short answer: We wanted to know, what were these tools that had just popped up? It was 2023 — was the hype true?

A more nuanced answer is that we work for a scientific institution, and we know science is about pursuing knowledge. If we can understand these tools better, whether we use them or not, and even if there are uses that we haven’t identified yet, we’ll be able to stay up to speed proactively and serve reporters, authors, and press officers better.

In the white paper you talk about the difference between transcription and translation, making the assertion that while Chat GPT Plus could transcribe research, translation was beyond it. What’s the difference?

Just to draw back a little, I went to grad school for science journalism, and I had a professor who really reinforced the idea that what science writing offers, or science communication offers, is an analysis for your audience.

Your audience is not reading the paper themselves. They’re not drawing their own insights. What you can offer them is your take on those insights. To me, transcription does not have a contextual or analytical fingerprint in the same way that translation does.

Translating requires you to master the original material, adapt it, and preserve its meaning. I was thinking about this yesterday, and I thought, every year there’s a new edition, a new translation, of the Odyssey. Every time, it gets a review in the New York Times or some other publication — and it’s not redundant, because it involves interpretation, rather than regurgitation.

Every time a translator takes a book and puts it in their own words, they are interpreting the material slightly differently. What we found was that ChatGPT Plus couldn’t do that. It could regurgitate or transcribe, but it couldn’t achieve the nuance to count as its own interpretation of a study.

I think that’s because ChatGPT Plus isn’t in society — it doesn’t interact with the world. It’s predictive, but it’s not distilling or conceptualizing what matters most to a human audience, or the value that we place in narratives that are ingrained in our society. An example of this is that we challenged ChatGPT Plus to do a summary of two studies at once. That, to me, requires a lot of translation, because you’re taking two studies, you’re pulling a theme from both, and presenting it in a digestible narrative. ChatGPT Plus could only focus on one.

I feel like what you’re getting at is that there’s something inextricably human about understanding and making sense of a scientific study.

Yeah, that’s a beautiful way to put it. I think when you read a study, you’re always looking to see, how can I find myself in it? We look for ourselves to some degree — even on a biomolecular level. For example, it’s very interesting, if you think about it, to learn about the immune system if you’ve had pneumonia that was antibiotic resistant. Maybe it’s a question that we’ve always sat on, or, you know, a fascination we’ve always had with a dinosaur.

I would argue, for a lot of research, there’s usually a spark. When people read what we write, it’s our job to share what we found in it that was interesting to us. Obviously, you need to collaborate when you pick newsworthy studies — otherwise, you’re just going to throw in a bunch of molecular studies about the gut microbiome because you’re obsessed with psyllium husks — again, my own issue.

But I think the throughline is just your own interest and how you can share it. We like to be engaged with each other when we talk, and it’s the same thing when we write.

Let’s talk about trust. The press team at Science is an important conduit between scientific research and the public — were you worried about losing or breaking that chain of trust by involving AI?

So, first, we did this project with no intent of incorporating it into our workflow. We did it because it’s important to stay informed and curious, and we were very concerned about how LLMs might impact our reputation. I can’t stress that enough.

We also know that science reporters are skeptical — rightfully so — about AI assisted press releases. We monitor metrics for the studies we publish, and you can tell when a syndication site uses an AI title. It’s very obvious that it’s slightly off base for the study. So we were skeptical.

Now, after this experiment, we’re very against using it. After a year of data, we know it can’t meet our standards. If we ever did plan to use it, we’d have to implement super rigorous fact-checking, because we don’t want to lose reporters’ trust.

I think Matt and Meagan have done an incredible job of instilling in our team that we are standing upon the shoulders of our peers, and that we are part of a long process of building trust that involves the authors, the peer reviewers, the editors. God forbid we mess that up. But it’s also our job to take that process one step further — to provide a service that a lot of researchers aren’t trained in.

The only way we can do that is to get reporters to understand that we deeply believe in accuracy. It really gets my goat when I see a study that I wrote about misrepresented. Sometimes we’ll pick studies that have news potential, but that also we worry will be misrepresented, and we’ll decide that they can benefit from a summary simply to explain the potential for misrepresentation.

So yeah, we are skeptical. We would never want to lose the trust that we have built.

It’s not nothing to make some of this work a little easier. How do you feel about that? Should we try to stay open to the possibility that some of these AI tools might be helpful?

I tend to be very risk adverse, just in my life, so the idea of making a mistake and not catching it while using an LLM seems to me far more work to correct. I mean, it takes so long to repair a broken relationship with a reporter, or to walk back misinformation that you put out. The idea of a mistake is so concerning that it does not seem like an easier workload for me.

However, I have heard, and I do think this is interesting, of writers taking their own summaries and putting it into the LLM and asking, what is the main message of this piece? I think that would be interesting — I can see asking things like, if you read this as a high schooler, what is the main point of this summary? It could be interesting to experiment with that. But again, I prefer my human editor Matt, because I learn new stuff from him every week. I recently overcame an intense dislike of the semicolon after understanding its use. ChatGPT Plus can never give me that. I do appreciate that there are uses for it. I would never use it for creativity — as a source of creativity — maybe more so as a project management or a summation tool.

Do you do creative work outside of writing at AAAS?

No. My teammates tell me that baking is creative, and I do bake a lot, but no, I’m not a creative writer. I think nonfiction writing is the limit of my creativity — but perhaps that’s just because I am so tired at the end of every day.

Are there any summaries from this experiment that popped out at you, either as impressive or as a disaster? You mentioned that you were trying to be fair by treating the very bad samples as outliers, and I appreciate that.

I do have a good and a bad example. The most interesting example is also the one that impressed me: I found that ChatGPT Plus was very good at summarizing reviews that we publish, and perspectives. I was thinking, why is that? And then I realized it’s transcribing material that has already been translated by a human. Often reviews are interpreting science — a perspective article is taking a look at a big field, and so you’re naturally incorporating translation into those summaries.

I don’t think that’s necessarily devaluing the quality of the AI summaries — perhaps if you’re working with material that already has been through translation, that’s fine. Maybe it fills that purpose. Again, though, I probably would still want to do so much fact-checking on them that I don’t know if there would be a return on investment.

And I do have a negative example. I don’t think it’s the worst negative, because it did come up a few times. And that is, ChatGPT Plus is not great at understanding correlation versus causation. I have an example of this in the white paper, and it’s the one I’m going to bring up: It was a Science Advances study on BMI in childhood and adulthood, and psychiatric disorders.

The paper involved a correlational finding between higher BMI in childhood and schizophrenia risk that also went away in adulthood. It also found some insights about OCD, but again, it was very correlational. One of the titles that ChatGPT Plus provided was “Childhood obesity increases schizophrenia risk, new study finds.” And I was like — “Oh, no.” That was extremely concerning to me. If was editing that summary without reading the paper, would I be able to spot that? I’d have to read the entire study and go to the methods, which doesn’t simplify my workload at all. That makes it worse.

I would be ashamed to share that summary with a family member, because then they would take that on and spread it. And I cannot imagine sharing that with a reporter. That’s like a rookie J-school 101 mistake. So that was a shock to the system.

How do you see LLMs differently than you would have if you hadn’t done this experiment?

When I started this experiment, I was not too fazed by ChatGPT Plus. I kind of saw AI as another Industrial Revolution — like, humans will need new skills to interact with this tool. I still feel the same way, but I also think that there needs to be a very methodical approach when we’re thinking about the skills we want to develop.

This project made me realize that we’re going to have to collectively and intentionally decide how we’re going to use, like, this generation’s printing press. I think it can’t be an organic evolution, because we have to be cautious. Trust is the heart of science communication, so we have to work within our community and remember who our audiences are, and the commitment we have to our readers as we test the boundaries of these tools.

We can’t just get caught up in how quickly they can produce content. We need to engage in discussion before we take these steps. That’s not to say people shouldn’t be experimenting, but I think there should be transparency in experimentation so that we can all discuss the results and think it through.

You’ve suggested that you — or at least someone at AAAS — will be redoing the experiment at some point. Do you have a plan for that yet?

We do not have a plan — I do not. I do hope somebody takes over on my team. I think we wouldn’t redo the exact experiment, because I don’t know if any of us realized in 2023 how quickly things would change — how fast new models would be rolling out. I think there is more merit to not doing it over a year again, because we’ve learned how fast this moves, but maybe doing something more intense, within a month or something so. But no plans as of yet.

You have plenty of other work to do.

Yes. And I would like to pivot. Everyone’s aware of that.

This did not spark a love affair with AI?

No. For me, it was kind of like, OK, this is just a tool that we’re going to have to learn more about and decide if we want to use, if it makes things easier for us. But it’s not revolutionary.

Also, I feel very secure in my job now. I did before, but now I feel like, there’s no way you can get rid of me like this.

4 thoughts on “Why Science’s press team won’t be using AI to write releases anytime soon”

Kat says:

November 12, 2025 at 8:04 am

Wow! Excellent – I am so glad this comparison was done! Thank you for writing about it here!

bjkeefe says:

November 12, 2025 at 11:19 am

Liked this interview a lot. Good questions, very thoughtful answers.

Laura Helmuth says:

November 13, 2025 at 9:53 am

Great discussion, thanks so much. I like the distinction between transcribing and translating. And a big Yikes about the correlation vs. causation problem.

Saugat Bolakhe says:

November 14, 2025 at 12:21 pm

Such an insightful and needed discussion!