Digging in on personal AI portability
Six months ago in this medium, I offered an optimistic high-level arc for the present and future of modern-day generative LLM technologies, from new, to widespread, to personal, to portable, and finally to empowered. The pace of change in AI conversations today is staggering. Yet, that vision remains intact. In this piece, I’ll dig in more on what data portability for generative technologies could and should entail, and how portability in this context can help lead to individual empowerment.
The ability of LLMs to be customized through personalization is becoming clearer. ChatGPT took a large step forward with its announcement of “memory” functions. ChatGPT can store memories, disable memories, view memories, and improve the experience through memories.
Compare ChatGPT’s memories with a web browser’s history and cookies. Web browsers store and use this information to help users remember and reconnect with what they have seen and done online. Browser users can see the stored information and delete it in a fine-grained manner - and at least temporarily, tell the browser to operate in a “private” mode where it does not store such data. Critically, when you want to use a new Web browser, you can import your history and your cookies from another browser.
Will the personalizations users provide to generative AI systems be as firmly under user control and as portable, or will lock-in effects develop? Regulators worldwide, already concerned with several dimensions of generative AI systems (alongside their enthusiasm, of course), are unlikely to take lock-in well – particularly European officials if the result is a perception of more market power for American or Chinese companies.
How far can generative AI portability go? If the goal of portability – at least in the context of competition policy – is to allow transference of key elements of a user’s experience from one service to another, that may be impossible. Differences in model weights and training sets, and tailoring of specific services to different contexts and use cases, all factor heavily into the experience, likely far more so than the impact of downstream customization powered by individual user data and experience. But such non-personalized training is (immensely) business sensitive, and beyond the typical contours of data portability.
DTI’s portability principles provide guidance for how such portability might be well scoped to be effective: “Portability policy should focus on user-created content and should not extend to data that negatively impacts the privacy of others or that is used to improve a service (e.g. “inferred data”).” What data could fall within this set? Here are three categories to consider:
- Input data: Data entered by the user of the system directly, as part of system use. For most generative LLM systems this takes the form of questions or “prompts.” Such data would seem to be at the core of data portability, and in fact is typically made available for individual user export in modern LLM systems.
- Output data: The direct outputs of generative LLM systems, as provided to users in dialogue, provide critical information in conjunction with user prompts to build an understanding of user expectations and activity. While these are directly created by the LLM system, the user interaction arguably constitutes co-creation. Furthermore, no other individual user interaction is at risk of being shared, and while this may be a question meriting more research, reverse-engineering proprietary information such as model weights from even a large collection of input/output pairs would seem difficult given challenges to explainability.
- Observed data: Some modern-day generative AI systems are observing user activity in either the digital or the physical world, through cameras or audio recording devices or simple screen captures; Microsoft’s Copilot+ PC is a powerful example. Such observed data is, arguably, a form or special case of input data, as it is a capture of the user’s actions. It also represents quite a significant potential for lock-in – if only one system has access to a user’s digital memories, the cost of switching to another becomes quite high.
A notable exclusion from these three categories is derived data, produced by the processing of input data through internal mechanisms, and not shared back to the user but retained for future internal use. This parallels past portability distinctions, and seems important to preserve here.
Observed data makes the prospect of portability quite complicated, however – particularly given the difficulties of distinguishing and excluding data based on the level of processing. Not all observed data will be retained, assuredly; the storage limitations would be significant, and requiring all raw data to be kept solely for portability could make systems unworkable. But could the selection of what observed data to store itself reveal anything proprietary? How can such a risk be assessed and mitigated?
Maybe most concerningly: Can a requirement to provide only observed and unprocessed data be easily obviated by processing incoming data with proprietary means, thus rendering it nonportable, and then throwing out the raw data?
There aren’t obvious or easy answers to these questions, and it will take some significant time and effort to work through them. But in the future as portability obligations and generative AI system personalization collide, making progress will matter significantly. Working towards a shared data model and portability approach could provide a useful forum to both develop and encode answers.
The future for autonomy in AI remains far from clear. Cornelia Kutterer contends, in a piece recently published by DTI, that we will see three layers of agents emerging, including platforms, applications, and services of various forms. Her intersection of the values and legal frameworks applicable to this question is a useful baseline. Complementing it will be more work in the direction of this piece - for example, what, exactly, should be ported to transfer a user’s experience meaningfully, without compromising security or secrets?
DTI Updates:
- During our all-day Data Transfer Summit in Washington DC earlier this year, seven scholars presented original research on data portability; we have now published on our website the final versions of these papers in a new volume: “The Present and Future of Data Portability.”
- At this year’s CPDP conference in late May, Delara hosted a workshop on data portability, AI, and the DMA, together with European academic Tommaso Crepax. It was well attended and well-received.
- We submitted a funding proposal to the Digital Infrastructure Insights Fund. The future of our trust efforts will take on a few forms, and one of those is technical, so we are seeking funding to support an initial implementation of a neutral 3rd-party trust registry for the specific context of data portability.