July 27, 2023
By Jamal Robinson
AI Machine Learning
Several billion-dollar tech companies lose money every year. For example, Snap and Uber made billion-dollar losses between them last year. What do companies like these have in common? They have vast amounts of data, and in today's digital era, data has become a commodity more valuable than gold.
As the age of AI dawns upon us, one fact has become clear - for generative AI to work efficiently, it requires data and lots of it. Diverse, representative, and bias-free data is not just a "nice to have" but a necessity for creating accurate AI models before methodologies such as supervised, semi-supervised, unsupervised, reinforcement, and transfer learning are used to build effective AI systems. But the intangibility and novelty of AI can make it easy to overlook how LLM (Large Language Model) providers acquire and process data.
As we've grown more globally conscious, we've seen increasing calls for transparency within supply chains - from lithium mining for electric cars to growing cotton for fast fashion. Many industries that deal with physical products succeed because of the ability to utilize cheap labor within supply chains, often from those residing in the global south.
Within generative AI, we can draw parallels in supply chains (not just in the materials that power the hardware but also) when it comes to sourcing and processing data.
Monetizing user data is not a novel concept. Web 2.0-centered platforms like Facebook pioneered this business model to serve highly targeted and profitable ads. While this approach has led to the explosive growth of numerous tech companies, it hasn't been controversy-free, as the Cambridge Analytica scandal shows.
Today, we're entering a phase where data is no longer just analyzed for insights or to power ads. It's being used to train generative AI models, and (ironically) the companies that have profited most from user-generated content on their platforms are now fighting with LLM providers for the data you put on their platforms.
Social media networks such as Reddit, Twitter, and Stack Overflow have announced plans to charge LLM providers for using their (or your) data. We've recently seen a spate of lawsuits over copyright issues outside of user-generated content, with LLM providers facing accusations of using copyrighted works to train their algorithms. This shift is underscored by Google's recent change in its privacy policy, permitting publicly available online data to be used in training its AI systems, showcasing how data is fast becoming a pivotal player in the evolving landscape.
In this intricate landscape, many profitable and non-profitable tech companies continue operations banking on their data repositories. They believe these vast data reserves could drive future profitability, especially as the demand for diverse data to train AI systems amplifies.
Navigating the legal aspects of this evolving data landscape is complex. Scraping/collecting public data on the web is not illegal; although many websites include terms that prohibit data scraping, there's not really any way to stop it. Copyrighted data is not illegal to collect, but it is illegal to use without the owner's permission. However, as generative AI platforms don't tend to disclose their training data, in most cases, it's difficult to prove that copyrighted data has been used (but there are some exceptions). Under most laws, collecting or using personally identifiable information, such as names and addresses, is illegal without the owner's explicit permission. However, personal information is often publicly available and easily accessed, making it possible for scraped personal data to train LLMs, without the owners' knowledge. This leaves identifiable information open to data extraction attacks like this one carried out on Open AI's older LLM, GPT-2.
Ultimately, the data economy creates new ethical imperatives and social responsibilities. As stewards of user data, tech leaders play an outsized role in driving the moral compass of AI systems influencing society.
Creating organizational frameworks to assess the risks and moral implications of data-driven technologies is indispensable, as analyzing AI ethics tradeoffs can be complex. Keeping diversity, inclusion, and equity at the heart of AI development to stop biased AI systems is vital.
The data revolution marks a watershed moment in technology's relationship with society. As trailblazers guide its trajectory, data scientists and software engineering leaders bear profound responsibility. By tirelessly championing transparency, ethics, and human welfare, we can illuminate data's immense potential to be a force for good. The choices we make today will profoundly shape our collective digital future.
In a world where "if it's free, you're the product," the implications for us as individuals are substantial. Our data is quickly becoming a commodity for AI training, positioning us as inadvertent data suppliers. The Black Mirror episode "Joan Is Awful" vividly highlights the potential misuse of the data we sign away in lengthy terms and conditions agreements, urging us to contemplate our privacy and security in this evolving data economy.
As we continue to wade through the complexities of the emerging data trade, striking a balance between advancing AI and preserving privacy becomes crucial. Awareness of our digital footprint's value and its implications in the new data economy is more important than ever. So as platforms continue the fight to commoditize your data, will you think twice before posting your thoughts online?
Ready to discuss your software engineering needs with our team of experts?