Frustrations with Google Takeout

Problems I experienced with Google Takeout and how I worked around them

When I purchased the mulcahy.ca domain name I thought it would be a good idea to use Google Apps For Your Domain. Basically I got something that looked like a normal Google account - gmail, etc - but my login was [redacted]@mulcahy.ca instead of [redacted]@gmail.com. There were some frustrations over the years, but I was able to live with them. My takeaway from the experience is that it's better to use the mass market product than a niche product. If you want things to just work choose Toyota over Ferrari.

Google Apps For Your Domain went through multiple renames and each rename seemed to come with a price increase. I found some details here. I don't believe I had ever used the free product - maybe I was initially paying $50/year. Now the product is called Google Workspace and I'm paying $40 - per month! The cost has increased by nearly 10x. It seems to be aimed at small business owners rather than hobbyists.

I host my blog for free on Github Pages and my domain registrar provides email forwarding for $5/year, so I've wanted to leave Google Workspace and save some money for a long time. In theory, it should be easy. Google Takeout advertises the ability to download all of your data. In practice, not so much.

Takeout Problem #1: Incomplete data

The data I downloaded from takeout was incomplete. Apparently I'm not the only one. I downloaded folders from Google Drive and some files in subfolders were missing. This is a big problem for me because I have important records I can't lose. I'm fortunate that I noticed a problem before deleting the original data.

Takeout Problem #2: Missing metadata

When I used takeout to download photos from Google Photos the timestamps were missing. This was a big problem for me because my photos of my children as babies were in my Google Workspace account. I really wanted to know when each photo was taken.

Takeout Problem #3: File conversion

I had lots of gdocs and Takeout automatically converted them to docx. Some of the file/folder names got changed slightly. This is not as much of a deal-breaker as the first two, but it still makes me unhappy. I wanted to re-upload the takeout files into my personal Gmail account, so they should be able to remain as gdocs. Although this whole debacle makes me question whether I should be using gdocs. I will defer that to a later time.

Google Photos Solution

I used "Partner Sharing" to share all of my photos with my personal Gmail account, and then in that personal account I chose to save all photos to my account. This worked quite well and really wasn't too hard. The only thing I didn't like is that it didn't tell me when sharing was complete, so I don't really know when it's safe to delete the originals, but probably it's okay a day later? Make sure you choose the option to save the shared photos to your account.

Google Drive Solution

I couldn't find an easy solution for Google Drive. There is a setting to transfer ownership but it doesn't allow you to transfer outside of your "organization". You can share folders with people outside of your organization, but you can't transfer ownership.

I also tried syncing data with Google Drive for Desktop and then manually copying. This lost the gdrive-native files, like gdoc. The problem is that these are synced as pointers to the original files, which are owned by my workspace account.

I moved all of my content into a folder named to-share and shared that with my personal account. Then I had ChatGPT write me some Python to copy that folder recursively. I had to generate a credentials.json file which was annoying and I'd rather not have done with my personal account. But I'll delete the credentials after the operation is complete.

ChatGPT's code kind of worked, but I had to tweak it to preserve file name and metadata. And then I noticed that it was only copying the first 100 files in large folders, so I had to change it to handle pagination. It's worthwhile double-checking the results. Here's the code in case it helps someone else.

I find it absurd how difficult this is, but I think it's unlikely there was any ill intent. Google probably made the decision to not allow ownership transfer outside of the organization because it'd be rather catastrophic if you transferred ownership accidentally.

import os
import io
import google.auth
from google.auth.transport.requests import Request
from google.oauth2.credentials import Credentials
from google_auth_oauthlib.flow import InstalledAppFlow
from googleapiclient.discovery import build
from googleapiclient.http import MediaIoBaseDownload

# If modifying or deleting the scope later, delete the token.json file to revoke the old one
SCOPES = ['https://www.googleapis.com/auth/drive']

# Authenticate and create the service
def authenticate():
    """Authenticate and return the service."""
    creds = None
    # The file token.json stores the user's access and refresh tokens, and is created automatically when the authorization flow completes for the first time.
    if os.path.exists('token.json'):
        creds = Credentials.from_authorized_user_file('token.json', SCOPES)
    # If there are no (valid) credentials available, let the user log in.
    if not creds or not creds.valid:
        if creds and creds.expired and creds.refresh_token:
            creds.refresh(Request())
        else:
            flow = InstalledAppFlow.from_client_secrets_file(
                'credentials.json', SCOPES)
            creds = flow.run_local_server(port=8080)
        # Save the credentials for the next run
        with open('token.json', 'w') as token:
            token.write(creds.to_json())

    service = build('drive', 'v3', credentials=creds)
    return service

def get_folder_contents(service, folder_id):
    """Get all files and subfolders in a folder, handling pagination."""
    items = []
    page_token = None

    while True:
        # List the files in the folder, handling pagination with page_token
        results = service.files().list(
            q=f"'{folder_id}' in parents",
            fields="nextPageToken, files(id, name, mimeType)",
            pageToken=page_token
        ).execute()

        # Add the files from this page to the list of items
        items.extend(results.get('files', []))

        # Check if there is another page of results
        page_token = results.get('nextPageToken')
        if not page_token:
            break  # No more pages, exit the loop

    print(f"returning {len(items)} items")

    return items

def copy_file(service, file_id, folder_id):
    """Copy a file to the new folder, preserving the original file name."""
    # Get the file's metadata to preserve its original name
    file = service.files().get(fileId=file_id, fields='name').execute()
    file_name = file['name']

    # Prepare the metadata for the copy operation
    file_metadata = {'name': file_name, 'parents': [folder_id]}

    # Copy the file to the new folder
    copied_file = service.files().copy(fileId=file_id, body=file_metadata).execute()
    print(f"Copied file: {copied_file['name']} - original_id={file_id}, new_id={copied_file['id']}, mimeType={copied_file['mimeType']}")
    print(f"{copied_file=}")

    return copied_file['id']

# After copying the file, restore timestamps
def restore_timestamps(service, copied_file_id, original_file_id):
    original_file = service.files().get(fileId=original_file_id, fields='createdTime, modifiedTime').execute()
    created_time = original_file['createdTime']
    modified_time = original_file['modifiedTime']

    # Update the copied file's timestamps (Google Drive doesn't allow setting createdTime directly, but we can update modifiedTime)
    updated_file_metadata = {'modifiedTime': modified_time}
    service.files().update(fileId=copied_file_id, body=updated_file_metadata).execute()

    print(f"Restored timestamps for copied file {copied_file_id}")

# Copy a folder
def copy_folder(service, source_folder_id, destination_folder_id):
    """Recursively copy a folder and its contents."""
    # First, copy all files in the folder
    items = get_folder_contents(service, source_folder_id)

    for item in items:
        if item['mimeType'] == 'application/vnd.google-apps.folder':  # If it's a folder
            # Create the folder in the destination
            folder_metadata = {'name': item['name'], 'mimeType': 'application/vnd.google-apps.folder', 'parents': [destination_folder_id]}
            new_folder = service.files().create(body=folder_metadata, fields='id, name').execute()
            print(f"Created folder: {new_folder['name']}")
            restore_timestamps(service, copied_file_id=new_folder['id'], original_file_id=item['id'])
            # Recursively copy the contents of this folder
            copy_folder(service, item['id'], new_folder['id'])
        else:
            # Copy the file to the destination folder
            copied_file_id = copy_file(service, item['id'], destination_folder_id)
            restore_timestamps(service, copied_file_id=copied_file_id, original_file_id=item['id'])

# Main function to copy folder
def copy_drive_folder(source_folder_id, destination_folder_id):
    service = authenticate()
    copy_folder(service, source_folder_id, destination_folder_id)

if __name__ == '__main__':
    source_folder_id = 'XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX'  # Replace with your source folder ID
    destination_folder_id = 'YYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYY'  # Replace with your destination folder ID
    copy_drive_folder(source_folder_id, destination_folder_id)

You'll also need to pip install dependencies:

pip install --upgrade google-api-python-client google-auth-httplib2 google-auth-oauthlib

Gmail Solution

I didn't really have much of a problem here because I've been forwarding my email from my Google Workspace account to my personal account for years. I've also set up OfflineIMAP so I can keep local copies of my emails.

Solutions for other Google services

I don't think I really have anything else there, but I used Google Takeout to download everything else just in case.


If you enjoyed this post, please let me know on Twitter or Bluesky.

Posted December 27, 2024.

Tags: #python