Internetarchive: A Python Interface to archive.org

Internetarchive Library

Internetarchive is a python interface to archive.org.

Usage:

>>> from internetarchive import get_item
>>> item = get_item('govlawgacode20071')
>>> item.exists
True
copyright:
  1. 2012-2017 by Internet Archive.
license:

AGPL 3, see LICENSE for more details.

internetarchive.Item

class Item(archive_session, identifier, item_metadata=None)

Bases: internetarchive.item.BaseItem

This class represents an archive.org item. Generally this class should not be used directly, but rather via the internetarchive.get_item() function:

>>> from internetarchive import get_item
>>> item = get_item('stairs')
>>> print(item.metadata)

Or to modify the metadata for an item:

>>> metadata = dict(title='The Stairs')
>>> item.modify_metadata(metadata)
>>> print(item.metadata['title'])
'The Stairs'

This class also uses IA’s S3-like interface to upload files to an item. You need to supply your IAS3 credentials in environment variables in order to upload:

>>> item.upload('myfile.tar', access_key='Y6oUrAcCEs4sK8ey',
...                           secret_key='youRSECRETKEYzZzZ')
True

You can retrieve S3 keys here: https://archive.org/account/s3.php

download(files=None, formats=None, glob_pattern=None, dry_run=None, verbose=None, silent=None, ignore_existing=None, checksum=None, destdir=None, no_directory=None, retries=None, item_index=None, ignore_errors=None, on_the_fly=None, return_responses=None, no_change_timestamp=None)

Download files from an item.

Parameters:
  • files – (optional) Only download files matching given file names.
  • formats (str) – (optional) Only download files matching the given Formats.
  • glob_pattern (str) – (optional) Only download files matching the given glob pattern.
  • dry_run (bool) – (optional) Output download URLs to stdout, don’t download anything.
  • verbose (bool) – (optional) Turn on verbose output.
  • silent (bool) – (optional) Suppress all output.
  • ignore_existing (bool) – (optional) Skip files that already exist locally.
  • checksum (bool) – (optional) Skip downloading file based on checksum.
  • destdir (str) – (optional) The directory to download files to.
  • no_directory (bool) – (optional) Download files to current working directory rather than creating an item directory.
  • retries (int) – (optional) The number of times to retry on failed requests.
  • item_index (int) – (optional) The index of the item for displaying progress in bulk downloads.
  • ignore_errors (bool) – (optional) Don’t fail if a single file fails to download, continue to download other files.
  • on_the_fly (bool) – (optional) Download on-the-fly files (i.e. derivative EPUB, MOBI, DAISY files).
  • return_responses (bool) – (optional) Rather than downloading files to disk, return a list of response objects.
  • no_change_timestamp (bool) – (optional) If True, leave the time stamp as the current time instead of changing it to that given in the original archive.
Return type:

bool

Returns:

True if if all files have been downloaded successfully.

get_file(file_name, file_metadata=None)

Get a File object for the named file.

Return type:internetarchive.File
Returns:An internetarchive.File object.
Parameters:file_metadata (dict) – (optional) a dict of metadata for the given fille.
modify_metadata(metadata, target=None, append=None, append_list=None, priority=None, access_key=None, secret_key=None, debug=None, request_kwargs=None)

Modify the metadata of an existing item on Archive.org.

Note: The Metadata Write API does not yet comply with the latest Json-Patch standard. It currently complies with version 02.

Parameters:
  • metadata (dict) – Metadata used to update the item.
  • target (str) – (optional) Set the metadata target to update.
  • priority (int) – (optional) Set task priority.
  • append (bool) – (optional) Append value to an existing multi-value metadata field.
  • append_list (bool) – (optional) Append values to an existing multi-value metadata field. No duplicate values will be added.

Usage:

>>> import internetarchive
>>> item = internetarchive.Item('mapi_test_item1')
>>> md = dict(new_key='new_value', foo=['bar', 'bar2'])
>>> item.modify_metadata(md)
Return type:dict
Returns:A dictionary containing the status_code and response returned from the Metadata API.
upload(files, metadata=None, headers=None, access_key=None, secret_key=None, queue_derive=None, verbose=None, verify=None, checksum=None, delete=None, retries=None, retries_sleep=None, debug=None, request_kwargs=None)

Upload files to an item. The item will be created if it does not exist.

Parameters:
  • files (list) – The filepaths or file-like objects to upload.
  • kwargs (dict) – The keyword arguments from the call to upload_file().

Usage:

>>> import internetarchive
>>> item = internetarchive.Item('identifier')
>>> md = dict(mediatype='image', creator='Jake Johnson')
>>> item.upload('/path/to/image.jpg', metadata=md, queue_derive=False)
True
Return type:list
Returns:A list of requests.Response objects.
upload_file(body, key=None, metadata=None, headers=None, access_key=None, secret_key=None, queue_derive=None, verbose=None, verify=None, checksum=None, delete=None, retries=None, retries_sleep=None, debug=None, request_kwargs=None)

Upload a single file to an item. The item will be created if it does not exist.

Parameters:
  • body (Filepath or file-like object.) – File or data to be uploaded.
  • key (str) – (optional) Remote filename.
  • metadata (dict) – (optional) Metadata used to create a new item.
  • headers (dict) – (optional) Add additional IA-S3 headers to request.
  • queue_derive (bool) – (optional) Set to False to prevent an item from being derived after upload.
  • verify (bool) – (optional) Verify local MD5 checksum matches the MD5 checksum of the file received by IAS3.
  • checksum (bool) – (optional) Skip based on checksum.
  • delete (bool) – (optional) Delete local file after the upload has been successfully verified.
  • retries (int) – (optional) Number of times to retry the given request if S3 returns a 503 SlowDown error.
  • retries_sleep (int) – (optional) Amount of time to sleep between retries.
  • verbose (bool) – (optional) Print progress to stdout.
  • debug (bool) – (optional) Set to True to print headers to stdout, and exit without sending the upload request.

Usage:

>>> import internetarchive
>>> item = internetarchive.Item('identifier')
>>> item.upload_file('/path/to/image.jpg',
...                  key='photos/image1.jpg')
True

internetarchive.File

class File(item, name, file_metadata=None)

Bases: internetarchive.files.BaseFile

This class represents a file in an archive.org item. You can use this class to access the file metadata:

>>> import internetarchive
>>> item = internetarchive.Item('stairs')
>>> file = internetarchive.File(item, 'stairs.avi')
>>> print(f.format, f.size)
('Cinepack', '3786730')

Or to download a file:

>>> file.download()
>>> file.download('fabulous_movie_of_stairs.avi')

This class also uses IA’s S3-like interface to delete a file from an item. You need to supply your IAS3 credentials in environment variables in order to delete:

>>> file.delete(access_key='Y6oUrAcCEs4sK8ey',
...             secret_key='youRSECRETKEYzZzZ')

You can retrieve S3 keys here: https://archive.org/account/s3.php

delete(cascade_delete=None, access_key=None, secret_key=None, verbose=None, debug=None, retries=None, headers=None)

Delete a file from the Archive. Note: Some files – such as <itemname>_meta.xml – cannot be deleted.

Parameters:
  • cascade_delete (bool) – (optional) Also deletes files derived from the file, and files the file was derived from.
  • access_key (str) – (optional) IA-S3 access_key to use when making the given request.
  • secret_key (str) – (optional) IA-S3 secret_key to use when making the given request.
  • verbose (bool) – (optional) Print actions to stdout.
  • debug (bool) – (optional) Set to True to print headers to stdout and exit exit without sending the delete request.
download(file_path=None, verbose=None, silent=None, ignore_existing=None, checksum=None, destdir=None, retries=None, ignore_errors=None, fileobj=None, return_responses=None, no_change_timestamp=None)

Download the file into the current working directory.

Parameters:
  • file_path (str) – Download file to the given file_path.
  • verbose (bool) – (optional) Turn on verbose output.
  • silent (bool) – (optional) Suppress all output.
  • ignore_existing (bool) – Overwrite local files if they already exist.
  • checksum (bool) – (optional) Skip downloading file based on checksum.
  • destdir (str) – (optional) The directory to download files to.
  • retries (int) – (optional) The number of times to retry on failed requests.
  • ignore_errors (bool) – (optional) Don’t fail if a single file fails to download, continue to download other files.
  • fileobj (file-like object) – (optional) Write data to the given file-like object (e.g. sys.stdout).
  • return_responses (bool) – (optional) Rather than downloading files to disk, return a list of response objects.
  • no_change_timestamp (bool) – (optional) If True, leave the time stamp as the current time instead of changing it to that given in the original archive.
Return type:

bool

Returns:

True if file was successfully downloaded.

internetarchive.Catalog

class Catalog(archive_session, identifier=None, task_id=None, params=None, config=None, verbose=None, request_kwargs=None)

Bases: object

This class represents the Archive.org catalog. You can use this class to access tasks from the catalog.

Usage::
>>> import internetarchive
>>> c = internetarchive.Catalog(internetarchive.session.ArchiveSession(),
...                             identifier='jstor_ejc')
>>> c.tasks[-1].task_id
143919540

internetarchive.ArchiveSession

class ArchiveSession(config=None, config_file=None, debug=None, http_adapter_kwargs=None)

Bases: requests.sessions.Session

The ArchiveSession object collects together useful functionality from internetarchive as well as important data such as configuration information and credentials. It is subclassed from requests.Session.

Usage:

>>> from internetarchive import ArchiveSession
>>> s = ArchiveSession()
>>> item = s.get_item('nasa')
Collection(identifier='nasa', exists=True)
get_item(identifier, item_metadata=None, request_kwargs=None)

A method for creating internetarchive.Item and internetarchive.Collection objects.

Parameters:
  • identifier (str) – A globally unique Archive.org identifier.
  • item_metadata (dict) – (optional) A metadata dict used to initialize the Item or Collection object. Metadata will automatically be retrieved from Archive.org if nothing is provided.
  • request_kwargs (dict) – (optional) Keyword arguments to be used in requests.sessions.Session.get() request.
get_metadata(identifier, request_kwargs=None)

Get an item’s metadata from the Metadata API

Parameters:identifier (str) – Globally unique Archive.org identifier.
Return type:dict
Returns:Metadat API response.
get_tasks(identifier=None, task_id=None, task_type=None, params=None, config=None, verbose=None, request_kwargs=None)

Get tasks from the Archive.org catalog. internetarchive must be configured with your logged-in-* cookies to use this function. If no arguments are provided, all queued tasks for the user will be returned.

Parameters:
  • identifier (str) – (optional) The Archive.org identifier for which to retrieve tasks for.
  • task_id (int or str) – (optional) The task_id to retrieve from the Archive.org catalog.
  • task_type (str) – (optional) The type of tasks to retrieve from the Archive.org catalog. The types can be either “red” for failed tasks, “blue” for running tasks, “green” for pending tasks, “brown” for paused tasks, or “purple” for completed tasks.
  • params (dict) – (optional) The URL parameters to send with each request sent to the Archive.org catalog API.
  • secure – (optional) Configuration options for session.
  • verbose (bool) – (optional) Set to True to retrieve verbose information for each catalog task returned. verbose is set to True by default.
Returns:

A set of CatalogTask objects.

mount_http_adapter(protocol=None, max_retries=None, status_forcelist=None, host=None)

Mount an HTTP adapter to the ArchiveSession object.

Parameters:
  • protocol (str) – HTTP protocol to mount your adapter to (e.g. ‘https://‘).
  • max_retries (int, object) – The number of times to retry a failed request. This can also be an urllib3.Retry object.
  • status_forcelist (list) – A list of status codes (as int’s) to retry on.
  • host (str) – The host to mount your adapter to.
search_items(query, fields=None, sorts=None, params=None, request_kwargs=None, max_retries=None)

Search for items on Archive.org.

Parameters:
  • query (str) – The Archive.org search query to yield results for. Refer to https://archive.org/advancedsearch.php#raw for help formatting your query.
  • fields (bool) – (optional) The metadata fields to return in the search results.
  • params (dict) – (optional) The URL parameters to send with each request sent to the Archive.org Advancedsearch Api.
Returns:

A Search object, yielding search results.

set_file_logger(log_level, path, logger_name=u'internetarchive')

Convenience function to quickly configure any level of logging to a file.

Parameters:
  • log_level (str) – A log level as specified in the logging module.
  • path (string) – Path to the log file. The file will be created if it doesn’t already exist.
  • logger_name (str) – (optional) The name of the logger.

internetarchive.api

internetarchive.api

This module implements the Internetarchive API.

copyright:
  1. 2012-2017 by Internet Archive.
license:

AGPL 3, see LICENSE for more details.

configure(username=None, password=None, config_file=None)[source]

Configure internetarchive with your Archive.org credentials.

Parameters:
  • username (str) – The email address associated with your Archive.org account.
  • password (str) – Your Archive.org password.
Usage:
>>> from internetarchive import configure
>>> configure('user@example.com', 'password')
delete(identifier, files=None, formats=None, glob_pattern=None, cascade_delete=None, access_key=None, secret_key=None, verbose=None, debug=None, **kwargs)[source]

Delete files from an item. Note: Some system files, such as <itemname>_meta.xml, cannot be deleted.

Parameters:
  • identifier (str) – The globally unique Archive.org identifier for a given item.
  • files – (optional) Only return files matching the given filenames.
  • formats – (optional) Only return files matching the given formats.
  • glob_pattern (str) – (optional) Only return files matching the given glob pattern.
  • cascade_delete (bool) – (optional) Also deletes files derived from the file, and files the filewas derived from.
  • access_key (str) – (optional) IA-S3 access_key to use when making the given request.
  • secret_key (str) – (optional) IA-S3 secret_key to use when making the given request.
  • verbose (bool) – Print actions to stdout.
  • debug (bool) – (optional) Set to True to print headers to stdout and exit exit without sending the delete request.
download(identifier, files=None, formats=None, glob_pattern=None, dry_run=None, verbose=None, silent=None, ignore_existing=None, checksum=None, destdir=None, no_directory=None, retries=None, item_index=None, ignore_errors=None, on_the_fly=None, return_responses=None, **get_item_kwargs)[source]

Download files from an item.

Parameters:
  • identifier (str) – The globally unique Archive.org identifier for a given item.
  • files – (optional) Only return files matching the given file names.
  • formats – (optional) Only return files matching the given formats.
  • glob_pattern (str) – (optional) Only return files matching the given glob pattern.
  • dry_run (bool) – (optional) Print URLs to files to stdout rather than downloading them.
  • verbose (bool) – (optional) Turn on verbose output.
  • silent (bool) – (optional) Suppress all output.
  • ignore_existing (bool) – (optional) Skip files that already exist locally.
  • checksum (bool) – (optional) Skip downloading file based on checksum.
  • destdir (str) – (optional) The directory to download files to.
  • no_directory (bool) – (optional) Download files to current working directory rather than creating an item directory.
  • retries (int) – (optional) The number of times to retry on failed requests.
  • item_index (int) – (optional) The index of the item for displaying progress in bulk downloads.
  • ignore_errors (bool) – (optional) Don’t fail if a single file fails to download, continue to download other files.
  • on_the_fly (bool) – (optional) Download on-the-fly files (i.e. derivative EPUB, MOBI, DAISY files).
  • return_responses (bool) – (optional) Rather than downloading files to disk, return a list of response objects.
  • **kwargs – Optional arguments that get_item takes.
Return type:

bool

Returns:

True if all files were downloaded successfully.

get_files(identifier, files=None, formats=None, glob_pattern=None, on_the_fly=None, **get_item_kwargs)[source]

Get File objects from an item.

Parameters:
  • identifier (str) – The globally unique Archive.org identifier for a given item.
  • files – iterable
  • files – (optional) Only return files matching the given filenames.
  • formats – iterable
  • formats – (optional) Only return files matching the given formats.
  • glob_pattern (str) – (optional) Only return files matching the given glob pattern.
  • on_the_fly (bool) – (optional) Include on-the-fly files (i.e. derivative EPUB, MOBI, DAISY files).
  • **get_item_kwargs – (optional) Arguments that get_item() takes.
Usage:
>>> from internetarchive import get_files
>>> fnames = [f.name for f in get_files('nasa', glob_pattern='*xml')]
>>> print(fnames)
['nasa_reviews.xml', 'nasa_meta.xml', 'nasa_files.xml']
get_item(identifier, config=None, config_file=None, archive_session=None, debug=None, http_adapter_kwargs=None, request_kwargs=None)[source]

Get an Item object.

Parameters:
  • identifier (str) – The globally unique Archive.org item identifier.
  • config (dict) – (optional) A dictionary used to configure your session.
  • config_file (str) – (optional) A path to a config file used to configure your session.
  • archive_session (ArchiveSession) – (optional) An ArchiveSession object can be provided via the archive_session parameter.
  • http_adapter_kwargs (dict) – (optional) Keyword arguments that requests.adapters.HTTPAdapter takes.
  • request_kwargs (dict) – (optional) Keyword arguments that requests.Request takes.
Usage:
>>> from internetarchive import get_item
>>> item = get_item('nasa')
>>> item.item_size
121084
get_session(config=None, config_file=None, debug=None, http_adapter_kwargs=None)[source]

Return a new ArchiveSession object. The ArchiveSession object is the main interface to the internetarchive lib. It allows you to persist certain parameters across tasks.

Parameters:
  • config (dict) – (optional) A dictionary used to configure your session.
  • config_file (str) – (optional) A path to a config file used to configure your session.
  • http_adapter_kwargs (dict) – (optional) Keyword arguments that requests.adapters.HTTPAdapter takes.
Returns:

ArchiveSession object.

Usage:

>>> from internetarchive import get_session
>>> config = dict(s3=dict(access='foo', secret='bar'))
>>> s = get_session(config)
>>> s.access_key
'foo'

From the session object, you can access all of the functionality of the internetarchive lib:

>>> item = s.get_item('nasa')
>>> item.download()
nasa: ddddddd - success
>>> s.get_tasks(task_ids=31643513)[0].server
'ia311234'
get_tasks(identifier=None, task_ids=None, task_type=None, params=None, config=None, config_file=None, verbose=None, archive_session=None, http_adapter_kwargs=None, request_kwargs=None)[source]

Get tasks from the Archive.org catalog. internetarchive must be configured with your logged-in-* cookies to use this function. If no arguments are provided, all queued tasks for the user will be returned.

Parameters:
  • identifier (str) – (optional) The Archive.org identifier for which to retrieve tasks for.
  • task_ids (int or str) – (optional) The task_ids to retrieve from the Archive.org catalog.
  • task_type (str) – (optional) The type of tasks to retrieve from the Archive.org catalog. The types can be either “red” for failed tasks, “blue” for running tasks, “green” for pending tasks, “brown” for paused tasks, or “purple” for completed tasks.
  • params (dict) – (optional) The URL parameters to send with each request sent to the Archive.org catalog API.
  • secure – (optional) Configuration options for session.
  • verbose (bool) – (optional) Set to True to retrieve verbose information for each catalog task returned. verbose is set to True by default.
Returns:

A set of CatalogTask objects.

get_user_info(access_key, secret_key)[source]

Returns details about an Archive.org user given an IA-S3 key pair.

Parameters:
  • access_key (str) – IA-S3 access_key to use when making the given request.
  • secret_key (str) – IA-S3 secret_key to use when making the given request.
get_username(access_key, secret_key)[source]

Returns an Archive.org username given an IA-S3 key pair.

Parameters:
  • access_key (str) – IA-S3 access_key to use when making the given request.
  • secret_key (str) – IA-S3 secret_key to use when making the given request.
modify_metadata(identifier, metadata, target=None, append=None, append_list=None, priority=None, access_key=None, secret_key=None, debug=None, request_kwargs=None, **get_item_kwargs)[source]

Modify the metadata of an existing item on Archive.org.

Parameters:
  • identifier (str) – The globally unique Archive.org identifier for a given item.
  • metadata (dict) – Metadata used to update the item.
  • target (str) – (optional) The metadata target to update. Defaults to metadata.
  • append (bool) – (optional) set to True to append metadata values to current values rather than replacing. Defaults to False.
  • append_list (bool) – (optional) Append values to an existing multi-value metadata field. No duplicate values will be added.
  • priority (int) – (optional) Set task priority.
  • access_key (str) – (optional) IA-S3 access_key to use when making the given request.
  • secret_key (str) – (optional) IA-S3 secret_key to use when making the given request.
  • debug (bool) – (optional) set to True to return a requests.Request object instead of sending request. Defaults to False.
  • **get_item_kwargs – (optional) Arguments that get_item takes.
Returns:

requests.Response object or requests.Request object if debug is True.

search_items(query, fields=None, sorts=None, params=None, archive_session=None, config=None, config_file=None, http_adapter_kwargs=None, request_kwargs=None, max_retries=None)[source]

Search for items on Archive.org.

Parameters:
  • query (str) – The Archive.org search query to yield results for. Refer to https://archive.org/advancedsearch.php#raw for help formatting your query.
  • fields (list) – (optional) The metadata fields to return in the search results.
  • params (dict) – (optional) The URL parameters to send with each request sent to the Archive.org Advancedsearch Api.
  • secure – (optional) Configuration options for session.
  • config_file (str) – (optional) A path to a config file used to configure your session.
  • http_adapter_kwargs (dict) – (optional) Keyword arguments that requests.adapters.HTTPAdapter takes.
  • request_kwargs (dict) – (optional) Keyword arguments that requests.Request takes.
  • max_retries (int, object) –

    The number of times to retry a failed request. This can also be an urllib3.Retry object. If you need more control (e.g. status_forcelist), use a ArchiveSession object, and mount your own adapter after the session object has been initialized. For example:

    >>> s = get_session()
    >>> s.mount_http_adapter()
    >>> search_results = s.search_items('nasa')
    

    See ArchiveSession.mount_http_adapter() for more details.

Returns:

A Search object, yielding search results.

upload(identifier, files, metadata=None, headers=None, access_key=None, secret_key=None, queue_derive=None, verbose=None, verify=None, checksum=None, delete=None, retries=None, retries_sleep=None, debug=None, request_kwargs=None, **get_item_kwargs)[source]

Upload files to an item. The item will be created if it does not exist.

Parameters:
  • identifier (str) – The globally unique Archive.org identifier for a given item.
  • files – The filepaths or file-like objects to upload. This value can be an iterable or a single file-like object or string.
  • metadata (dict) – (optional) Metadata used to create a new item. If the item already exists, the metadata will not be updated – use modify_metadata.
  • headers (dict) – (optional) Add additional HTTP headers to the request.
  • access_key (str) – (optional) IA-S3 access_key to use when making the given request.
  • secret_key (str) – (optional) IA-S3 secret_key to use when making the given request.
  • queue_derive (bool) – (optional) Set to False to prevent an item from being derived after upload.
  • verbose (bool) – (optional) Display upload progress.
  • verify (bool) – (optional) Verify local MD5 checksum matches the MD5 checksum of the file received by IAS3.
  • checksum (bool) – (optional) Skip uploading files based on checksum.
  • delete (bool) – (optional) Delete local file after the upload has been successfully verified.
  • retries (int) – (optional) Number of times to retry the given request if S3 returns a 503 SlowDown error.
  • retries_sleep (int) – (optional) Amount of time to sleep between retries.
  • debug (bool) – (optional) Set to True to print headers to stdout, and exit without sending the upload request.
  • **kwargs – Optional arguments that get_item takes.
Returns:

A list of requests.Response objects.

The internetarchive library is a Python & command-line interface to archive.org

Navigation

Useful links