How the Web Broke Data Portability

Once upon a time, the Internet made moving your email or calendar fairly easy. Then a shift in emphasis from storing data on local devices to storing data in “the cloud”, i.e. on remote servers, helped enable many new kinds of online activities and communities – but with the side effect of making data portability more difficult. It’s why DTI as an organization exists, to help design and deliver data transfer solutions for the modern, complex data landscape. To help understand that context, I want to write a little about this shift, and how we got to where we are today.

I love the Web. When I first started professional work in 1996 I was one of the only people in my company who knew how to build and host a Web site. I showed software industry folks (more senior than me, but who hadn’t just spent four years at university using the Web) how to find a site by using Yahoo!’s manually curated Web Directory or the brand-new ground-breaking search engine AltaVista. I became an evangelist for Web standards, explaining that lots more people could put content online and make that content much more engaging than FTP, Gopher servers and BBS systems could do. But along with Web standards, I also worked on IMAP and SMTP (mail standards), IRC (chat standard) and other classic pre-Web Internet protocols, so I have been exposed to both protocol styles for thirty years.

Here’s how application servers on the Internet tended to work before the Web:

alt_text

In order for me to communicate with a server to do a function like email, I needed to install a client that could speak a client-server protocol such as IMAP with an email server. But if my friend had a different email server, we needed a server-to-server protocol such as SMTP to propagate messages. [Technically, SMTP is not only S2S, so this is a simplified picture.] Besides email, functions like chat, calendaring and file sharing were being built out in a similar way. This was slow to build, but by the time it was widespread, there was choice not only in different email service providers, but also in different email clients.

A side-effect of this situation was that I could always get a complete copy of my mailbox - I could install an IMAP client application that could fetch all the email, understand the content, and save local copies. Many people moved email servers by telling their mail client applications to make sure to fetch all the email and store it locally, then they would sever the connection to the old email server, connect to a new one, and get the mail client application to synchronize all the email with the new server. Data portability was easy when client-server protocols were available, and helped provide real opportunities to change service providers. I changed email providers several times between 1991 and 2006 and brought my email archives with me each time, because I was able to choose a mail client application that was able to help me move my data.

Web 2.0, starting around 2005, started to change this by allowing applications to be built only for Web browsers. Application data was still downloaded to the user’s computer using HTTP, but embedded in Web pages using HTML. The browser doesn’t have to understand the information, only the display instructions.

This was a great improvement in many ways. It saved users from having to install an email client, a chat client, a calendar client, a contacts app, a photos client – users didn’t like to install and maintain that many different software applications. Companies who moved to Web interfaces stopped having to interoperate with all the old versions of all these client applications. They no longer had to wait five years for a protocol standard to have a single important new feature - they could just deliver the feature by modifying the view of information they presented on their Web pages.

However, a shared understanding of the information was lost. Here’s a snippet of how IMAP transfers message information, in which the same message would look the same from any IMAP server:

To: <me@example.com>
Message-Id: <cb6b04b3-b27b-41bb-b1b8@example.com>
Date: Thu, 3 Apr 2025 15:57:12 +1000
From: "Stacy Fakename" <stacy@other.example.org>
Subject: A few important questions

An IMAP client “understands” who the message was sent to, who it was sent from, and what its full subject is. Then it decides how to display those pieces of information, or search, filter or otherwise use.

How does a Web page show a message’s To, From, Subject and Message ID? It’s different for every service and every view. Part of it might look like this by the time it’s shown at the browser:

<div class="msg_checkbox msg_row__checkbox resizable_table__cell">
<div class="ticky_pair"><input class="ticky simple_msg msg_row__checkbox__input" 
id="msg_checkbox_257380241" onclick="R.messages.checkboxClicked();" type="checkbox">
<label for="msg_checkbox_257380241"></label></div></div>
<div class="resizable_table__cell msg_row__date rsp_only">April 3, 2025 4:31 PM</div>
<div class="resizable_table__cell msg_row___avatar">
<img class="msg_row___avatar__image rsp_only"
data-delayed-image-src="https://cache.example.com/Stacy/83502012/blue-hair_medium.jpg"
src="data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7">
<div class="msg_row__username">Stacy</div></div>
<div class="msg_row__subject msg_row__subject--read resizable_table__cell">
<a href="/people/lisa/messages?open=257380241" 
onclick="return R.messages.open(257380241, true);">A few important questions</a></div>
<div class="msg_row__content msg_row__content--read resizable_table__cell rsp_only" 
id="msg_row__content__257380241"></div>
<div class="resizable_table__cell date rsp_hidden">Apr 3, 2025</div>
<div class="resizable_table__cell time rsp_hidden">3:57 PM</div>
<div class="resizable_table__cell msg_row__reply_status reply_status" 
id="reply_status_257380241"></div>

Almost all the data in that HTML block is display information, not message information - the HTML block actually contains less information than the IMAP snippet which was only five lines. The subject and date are in the HTML, but they’re not tagged in a way that the client can identify them semantically. The data might also be transformed - e.g. a long Subject might be truncated and the client couldn’t reliably tell. Pulling the information out of this is hard and any code written to do that might break any day if the email provider changes details about their view.

Here’s how email works now:

alt_text

But the Web also often removes the need for some new server-to-server standards. Looking at photo sharing today, we can see how the Web works without application-specific standards even if we have multiple photo services. My friend and I have our photos on different services entirely, but we each use our Web browsers to “go to” the other person’s service to see their photos:

alt_text

Because neither service needs to implement photo sharing standards to speak with each other or with client applications, both services can innovate. They upgrade their features any time they have a great new idea, and we can each of us select the photo service that suits us best.

However, neither of our browsers get usable semantic information about the organization and tags or responses to our photos this way! We both get the views that the services decide to offer to us. If I don’t like the view of my friend’s photos, that’s just too bad. We can’t even reasonably “hack” information out of our photo Web pages because our services can change the structure of their views at any moment.

If we are going to get data portability now it’s up to the services to provide it, because they’re the only ones who have access to all the data. We’re also at the mercy of services to provide views that work for us, to provide data exports that really work, and to provide interoperability with partners if they choose. I can’t even opt out of using my friend’s photo service if I want to see her photos.

We’ve gotten used to adapting ourselves to the Web pages that we get, but it’s worth noting all the things that we can’t do this way:

Folks who need accessibility aids can’t use their own software to view or manage data in Web pages. This affects folks with low vision or colour blindness, but also folks who need keyboard shortcuts and can’t target a small mouse pointer inside a small checkbox easily. Web browsers have some ability to help but it’s limited. We could do better on the Web, yes, but for mature services applications could help a lot.
We can’t use our own software to support a particular use case or feature. A professional photographer may want different features from the person who hosts their family photo archive.
We can’t easily use software that filters out comments we don’t like, or filters out comments entirely.
We can’t choose a recommendation algorithm other than the one optimized by the service provider for their general purposes.
We can’t synchronize our data between two services (e.g. one to edit photos, one to share them) unless the service decides to do this for us.

We did see this coming. We predicted this in the Internet Standards community ten to fifteen years ago, with much discussion about security tradeoffs as well as innovation speed. Much of the standards work in the last fifteen years has been about addressing security issues while preserving innovation speed by making Web services even better, in both W3C and IETF.

So what now? I personally like to remind myself that despite what we’ve lost in semantic access to our data, we’ve also gained enormously in the last 20 years. Many of my friendships grew in online communities that could not have gained as much traction with only IMAP and FTP. Like many folks I enjoy shopping, music streaming, and watching videos online without having to install or upgrade specialized software. Maybe, though, it’s possible to have both Web views and semantic views. ¿Por qué no los dos?

How the Web Broke Data Portability

Catch up on the latest from DTI