We know content should be valuable, comprehensive, new, relevant, and accurate. Even more fundamentally important than all of those things is that it needs to be authentic.
People trust factual information presented with sincere intentions. This era of “fake news” has ushered in a lot of fear for this very reason; we have had to fight to build the authority of our pages and domains to signal that we are worthy of trust.
But our industry has yet to face its biggest challenge.
What We Are Up Against
Question: Which of these writing samples was 100% written by a computer?
Tricked you! They were both written by a new machine learning derived language model.
In February of this year, OpenAI released a paper and examples of a new unsupervised machine learning language model called GPT2. The model quickly set fire to the machine learning community by crushing many “state-of-the-art” records across many areas of natural language processing. According to OpenAi, GPT2:
"generates coherent paragraphs of text, achieves state-of-the-art performance on many language modeling benchmarks, and performs rudimentary reading comprehension, machine translation, question answering, and summarization—all without task-specific training.”
Given its capabilities, OpenAI made the decision not to release the full model, providing a partial, limited model instead. OpenAI stated publicly the reason for the limited release was because they were concerned that there were many potential ways the model could be leveraged for nefarious purposes.
What is perhaps most interesting is that GPT2 did not make any significant algorithmic advances; instead, it performs so well as the result of feeding it massive sets of data. This means that it is possible (and likely) that there are some who have been able to build the full GPT2 model (or similar) by feeding the limited release model more data.
GPT2 and the SEO Apocalypse
Those who have been in SEO for a while probably remember the days when “article spinning” was all the rage. Article spinning was essentially “rewriting” existing content. This could be done programmatically (finding/replacing words in a Mad Lib style) just enough to trick Google into seeing the content as unique, or it could be done by hiring low-cost writers to simply paraphrase existing content (more difficult to create massive quantities).
The practice of article spinning began a trend of bulk, low-quality content creation as a method for dominating long-tail search, enabling opportunistic websites to create huge volumes of content to reap disproportionate traffic rewards. Google’s answer to article spinning and the growing proliferation of low-quality content came with the Panda and Penguin updates along with future algorithmic updates, putting increasing focus on content quality and the elimination of thin or programmatically generated low-value content from ranking.
GPT2, and future iterations of it, promise to revive article spinning and perhaps even usher in a new revolution of content spam, the likes of which have never been seen online. Because of its programmatic nature, scaling up content creation to the millions of articles per day could become easy to execute, even for an individual.
Imagine once this tech becomes enabled for non-technical SEOs and non-programmers. It will make today’s article spinning programs look like child’s play and allow for a massive tsunami of computer-generated content across every niche imaginable—content that very likely can defeat Google’s current filters on low-quality content.
A Content Arms Race
With article spinning, it was clearly much easier for Google to identify the find/replace, Mad-Lib-style article spinning techniques vs. content that was entirely rewritten by real humans. Because the Mad-Lib-style of spinning could be done programmatically, it accounted for a much larger percentage of content spam. Fortunately, this type of content spam was also much easier for Google to identify and discount.
What makes GPT2 so potentially dangerous for search and SEO is that it is not clear that GPT2-level content can be easily identified, especially for shorter lengths of content. While there have been some early indications that it could be possible to find such content with counter-algorithms trained to identify computer-generated text, what we are seeing is essentially a new computer-generated-content arms race.
As Google figures out how to identify content generated by GPT2, new models trained on new datasets will be churned out, ready to produce content that will get through previously-implemented filters. It seems unlikely that Google will be able to fully solve the problem of high-quality, algorithmically-created content in the same way they were able to defeat the first generation of article spinning content spam.
The Implications For Video Content
With 10% of all SERPs including video results and ~1% including featured video results, the incentives for SEOs to create vast quantities of long-tail and niche video content that can rank are greater than ever. Up until recently, the “bottleneck” has been the time and cost commitments required to produce these large volumes of content.
In the last few years, however, we have begun to see the proliferation of algorithmically-generated YouTube video content spam, generated by algorithms polluting YouTube itself as well as SERPs. In 2017, HackerNoon wrote about this practice, and now here we are two years later, and two out of the three spam accounts mentioned in the article still exist, churning out incredibly low-quality content that generates hundreds of thousands of views per month, earning perhaps thousands of dollars per month for each of these accounts according to some estimates.
Examples like the one above were created using relatively simplistic tools, and the entire process can be automated very simply. Thus far, it seems Google doesn't have any great incentive to cleanse their platform of these videos, given they still allow ads to run on many of them (and profit from these ads).
Assuming their lack of action on this simplistically-created, auto-generated video content spam, it seems doubtful they will take an aggressive approach regarding this growing issue as transformative NLP models like GPT2 start being applied to the automated video creation process. Imagine a world where the process above for automatically creating content is supplemented by the content spinning prowess of GPT2 and other future iterations of machine text generation. Thousands upon thousands of videos on any given topic could be created, all increasingly more difficult to detect. It is not difficult to imagine that opportunists will soon realize this kind of content production pipeline isn't so difficult to create, and the “rewards” could be potentially massive.
Other Nefarious Implications
Other areas of digital media and digital marketing will also be influenced by and even shaped by these new state-of-the-art language models. A few highlights of what could happen:
In the same way, this new tech will be used to create long-tail content for SEO purposes, automated content creation will also be used to influence public opinion. Expect this technology to fuel the next generation of massive “fake news” proliferation across many mediums, including text, video, and audio.
GPT2 and other future language models can also be used as the underlying technology behind social media bots and chatbots. This time, though, they will be much more effective at fooling humans than previous incarnations. These models will most certainly be used to inject custom-written propaganda into many facets of our online lives. This will include chatbots that will post unique content to all social platforms, engage in conversations and comment on threads everywhere, and subversively sow seeds of divisiveness and discord as a way to influence conversations and public opinion.
Phishing and Email Spam
Expect email spam and phishing to get exponentially more effective as custom-written text improves a spammer’s ability to circumvent spam filters. Additionally, GPT2 and other new language models will become increasingly capable of fooling real humans, making them more likely to fall for scams.
Astroturfing and Eroding of Reviews
As if astroturfing and fake reviews weren’t a big enough problem, the problem promises to become significantly worse as GPT2 and other next-generation models become proficient in writing reviews that are extremely difficult for even humans themselves to discern from real reviews. This will likely only add to the increasing erosion of trust around reviews, perhaps even rendering sites like Yelp useless.
Automated Outreach and Link Building
Many SEOs leverage outreach as a core component of link building. As this practice has become more common, influential writers, editors, and bloggers are becoming increasingly inundated with pitches. Many writers receive 100+ pitches per day, crowding out the higher-quality pitches with real value. GPT2 and similar language models will find their way into the outreach process, perhaps enabling the scaling of “custom-written” outreach that isn’t truly custom written, causing the signal/noise ratio for quality PR work to worsen over time.
Existential Damage to Search, Social, and News
Perhaps the scariest implication of the next generation of NLP algorithms like GPT2 is their ability to chip away at public trust in an exponential way.
Because this content is computer generated, the scale at which it will be created will exponentially accelerate problems currently enabled by human-generated content like social media troll initiatives and fake news proliferation.
Mix this kind of technology with other burgeoning tech like deep fakes and voice mimicking AI, and you can imagine it might be possible to automatically generate huge volumes of fake text on whatever topic desired and have your favorite celebrity or politician “read” that text on video. Then you could use an army of computer-controlled bots spreading that content with their own automatically-created text and succeed at tricking many/most people.
For the SEO community and the world at large, these changes promise to be massively disruptive. In order to deal with and adapt successfully to these technological innovations, it is extremely important that we keep ourselves aware of the latest cutting-edge emerging machine learning models and techniques. By learning about these new technologies and thinking about their implications, we can hopefully come up with more value-added ways they can be leveraged and begin to find ways to mitigate and prevent the negative and subversive ways they might be abused.
What We Should Do Now
One thing that many machine learning experts agree on is that we are likely a long way off from “artificial general intelligence,” the name given to AI that would be able to fully pass the Turing test and perform at the same level as humans across all measures. For now, humans still reign supreme in higher-order thinking tasks associated with abstraction and reasoning. It is these higher-order human cognitive functions that cannot be replicated currently or in the near future, and it is in these functions that the future of content lies.
To be clear, I didn’t write this post to alarm everyone about what is to come—I wrote it to stress the vital importance of building authenticity right now.
This has, and always will be, a crucial component to successfully reaching our target audiences as marketers. The more we can establish and strengthen it, the more likely we can successfully face any challenges ahead.
By focusing on how AI can assist us in finding and exposing new and hidden information in data, we can tell increasingly sophisticated stories that no machine learning algorithm would be capable of producing.
So, leverage AI in positive ways to establish trust with your audiences and build up your reputation. Soon, you might have to fight harder to maintain it.