Developer Interface

Configuration

Certain functions of the internetarchive library require your archive.org credentials (i.e. uploading, modifying metadata, searching). Your credentials and other configurations can be provided via a dictionary when instantiating an ArchiveSession or Item object, or in a config file.

The easiest way to create a config file is with the configure function:

>>> from internetarchive import configure
>>> configure('user@example.com', 'password')

Config files are stored in either $HOME/.ia or $HOME/.config/ia.ini by default. You can also specify your own path:

>>> from internetarchive import configure
>>> configure('user@example.com', 'password', config_file='/home/jake/.config/ia-alternate.ini')

Custom config files can be specified when instantiating an ArchiveSession object:

>>> from internetarchive import get_session
>>> s = get_session(config_file='/home/jake/.config/ia-alternate.ini')

Or an Item object:

>>> form internetarchive import get_item
>>> item = get_item('nasa', config_file='/home/jake/.config/ia-alternate.ini')

IA-S3 Configuration

Your IA-S3 keys are required for uploading and modifying metadata. You can retrieve your IA-S3 keys at https://archive.org/account/s3.php.

They can be specified in your config file like so:

[s3]
access = mYaccEsSkEY
secret = mYs3cREtKEy

Or, using the ArchiveSession object:

>>> from internetarchive import get_session
>>> c = {'s3': {'access': 'mYaccEsSkEY', 'secret': 'mYs3cREtKEy'}}
>>> s = get_session(config=c)
>>> s.access_key
'mYaccEsSkEY'

Logging Configuration

You can specify logging levels and the location of your log file like so:

[logging]
level = INFO
file = /tmp/ia.log

Or, using the ArchiveSession object:

>>> from internetarchive import get_session
>>> c = {'logging': {'level': 'INFO', 'file': '/tmp/ia.log'}}
>>> s = get_session(config=c)

By default logging is turned off.

Other Configuration

By default all requests are HTTPS in Python versions 2.7.10 or newer. You can change this setting in your config file in the general section:

[general]
secure = False

Or, using the ArchiveSession object:

>>> from internetarchive import get_session
>>> s = get_session(config={'general': {'secure': False}})

In the example above, all requests will be made via HTTP.

ArchiveSession Objects

The ArchiveSession object is subclassed from requests.Session. It collects together your credentials and config.

get_session(config=None, config_file=None, debug=None, http_adapter_kwargs=None)

Return a new ArchiveSession object. The ArchiveSession object is the main interface to the internetarchive lib. It allows you to persist certain parameters across tasks.

Parameters:
  • config (dict) – (optional) A dictionary used to configure your session.
  • config_file (str) – (optional) A path to a config file used to configure your session.
  • http_adapter_kwargs (dict) – (optional) Keyword arguments that requests.adapters.HTTPAdapter takes.
Returns:

ArchiveSession object.

Usage:

>>> from internetarchive import get_session
>>> config = dict(s3=dict(access='foo', secret='bar'))
>>> s = get_session(config)
>>> s.access_key
'foo'

From the session object, you can access all of the functionality of the internetarchive lib:

>>> item = s.get_item('nasa')
>>> item.download()
nasa: ddddddd - success
>>> s.get_tasks(task_ids=31643513)[0].server
'ia311234'

Item Objects

Item objects represent Internet Archive items. From the Item object you can create new items, upload files to existing items, read and write metadata, and download or delete files.

get_item(identifier, config=None, config_file=None, archive_session=None, debug=None, http_adapter_kwargs=None, request_kwargs=None)

Get an Item object.

Parameters:
  • identifier (str) – The globally unique Archive.org item identifier.
  • config (dict) – (optional) A dictionary used to configure your session.
  • config_file (str) – (optional) A path to a config file used to configure your session.
  • archive_session (ArchiveSession) – (optional) An ArchiveSession object can be provided via the archive_session parameter.
  • http_adapter_kwargs (dict) – (optional) Keyword arguments that requests.adapters.HTTPAdapter takes.
  • request_kwargs (dict) – (optional) Keyword arguments that requests.Request takes.
Usage:
>>> from internetarchive import get_item
>>> item = get_item('nasa')
>>> item.item_size
121084

Uploading

Uploading to an item can be done using Item.upload():

>>> item = get_item('my_item')
>>> r = item.upload('/home/user/foo.txt')

Or internetarchive.upload():

>>> from internetarchive import upload
>>> r = upload('my_item', '/home/user/foo.txt')

The item will automatically be created if it does not exist.

Refer to archive.org Identifiers for more information on creating valid archive.org identifiers.

Setting Remote Filenames

Remote filenames can be defined using a dictionary:

>>> from io import BytesIO
>>> fh = BytesIO()
>>> fh.write(b'foo bar')
>>> item.upload({'my-remote-filename.txt': fh})
upload(identifier, files, metadata=None, headers=None, access_key=None, secret_key=None, queue_derive=None, verbose=None, verify=None, checksum=None, delete=None, retries=None, retries_sleep=None, debug=None, request_kwargs=None, **get_item_kwargs)

Upload files to an item. The item will be created if it does not exist.

Parameters:
  • identifier (str) – The globally unique Archive.org identifier for a given item.
  • files – The filepaths or file-like objects to upload. This value can be an iterable or a single file-like object or string.
  • metadata (dict) – (optional) Metadata used to create a new item. If the item already exists, the metadata will not be updated – use modify_metadata.
  • headers (dict) – (optional) Add additional HTTP headers to the request.
  • access_key (str) – (optional) IA-S3 access_key to use when making the given request.
  • secret_key (str) – (optional) IA-S3 secret_key to use when making the given request.
  • queue_derive (bool) – (optional) Set to False to prevent an item from being derived after upload.
  • verbose (bool) – (optional) Display upload progress.
  • verify (bool) – (optional) Verify local MD5 checksum matches the MD5 checksum of the file received by IAS3.
  • checksum (bool) – (optional) Skip uploading files based on checksum.
  • delete (bool) – (optional) Delete local file after the upload has been successfully verified.
  • retries (int) – (optional) Number of times to retry the given request if S3 returns a 503 SlowDown error.
  • retries_sleep (int) – (optional) Amount of time to sleep between retries.
  • debug (bool) – (optional) Set to True to print headers to stdout, and exit without sending the upload request.
  • **kwargs – Optional arguments that get_item takes.
Returns:

A list of requests.Response objects.

Metadata

modify_metadata(identifier, metadata, target=None, append=None, priority=None, access_key=None, secret_key=None, debug=None, request_kwargs=None, **get_item_kwargs)

Modify the metadata of an existing item on Archive.org.

Parameters:
  • identifier (str) – The globally unique Archive.org identifier for a given item.
  • metadata (dict) – Metadata used to update the item.
  • target (str) – (optional) The metadata target to update. Defaults to metadata.
  • append (bool) – (optional) set to True to append metadata values to current values rather than replacing. Defaults to False.
  • priority (int) – (optional) Set task priority.
  • access_key (str) – (optional) IA-S3 access_key to use when making the given request.
  • secret_key (str) – (optional) IA-S3 secret_key to use when making the given request.
  • debug (bool) – (optional) set to True to return a requests.Request object instead of sending request. Defaults to False.
  • **get_item_kwargs – (optional) Arguments that get_item takes.
Returns:

requests.Response object or requests.Request object if debug is True.

The default target to write to is metadata. If you would like to write to another target, such as files, you can specify so using the target parameter. For example, if we had an item whose identifier was my_identifier and you wanted to add a metadata field to a file within the item called foo.txt:

>>> r = modify_metadata('my_identifier', metadata=dict(title='My File'), target='files/foo.txt')
>>> from internetarchive import get_files
>>> f = list(get_files('iacli-test-item301', 'foo.txt'))[0]
>>> f.title
'My File'

You can also create new targets if they don’t exist:

>>> r = modify_metadata('my_identifier', metadata=dict(foo='bar'), target='extra_metadata')
>>> from internetarchive import get_item
>>> item = get_item('my_identifier')
>>> item.item_metadata['extra_metadata']
{'foo': 'bar'}

Downloading

download(identifier, files=None, formats=None, glob_pattern=None, dry_run=None, verbose=None, silent=None, ignore_existing=None, checksum=None, destdir=None, no_directory=None, retries=None, item_index=None, ignore_errors=None, on_the_fly=None, return_responses=None, **get_item_kwargs)

Download files from an item.

Parameters:
  • identifier (str) – The globally unique Archive.org identifier for a given item.
  • files – (optional) Only return files matching the given file names.
  • formats – (optional) Only return files matching the given formats.
  • glob_pattern (str) – (optional) Only return files matching the given glob pattern.
  • dry_run (bool) – (optional) Print URLs to files to stdout rather than downloading them.
  • verbose (bool) – (optional) Turn on verbose output.
  • silent (bool) – (optional) Suppress all output.
  • ignore_existing (bool) – (optional) Skip files that already exist locally.
  • checksum (bool) – (optional) Skip downloading file based on checksum.
  • destdir (str) – (optional) The directory to download files to.
  • no_directory (bool) – (optional) Download files to current working directory rather than creating an item directory.
  • retries (int) – (optional) The number of times to retry on failed requests.
  • item_index (int) – (optional) The index of the item for displaying progress in bulk downloads.
  • ignore_errors (bool) – (optional) Don’t fail if a single file fails to download, continue to download other files.
  • on_the_fly (bool) – (optional) Download on-the-fly files (i.e. derivative EPUB, MOBI, DAISY files).
  • return_responses (bool) – (optional) Rather than downloading files to disk, return a list of response objects.
  • **kwargs – Optional arguments that get_item takes.
Return type:

bool

Returns:

True if all files were downloaded successfully.

Deleting

delete(identifier, files=None, formats=None, glob_pattern=None, cascade_delete=None, access_key=None, secret_key=None, verbose=None, debug=None, **kwargs)

Delete files from an item. Note: Some system files, such as <itemname>_meta.xml, cannot be deleted.

Parameters:
  • identifier (str) – The globally unique Archive.org identifier for a given item.
  • files – (optional) Only return files matching the given filenames.
  • formats – (optional) Only return files matching the given formats.
  • glob_pattern (str) – (optional) Only return files matching the given glob pattern.
  • cascade_delete (bool) – (optional) Also deletes files derived from the file, and files the filewas derived from.
  • access_key (str) – (optional) IA-S3 access_key to use when making the given request.
  • secret_key (str) – (optional) IA-S3 secret_key to use when making the given request.
  • verbose (bool) – Print actions to stdout.
  • debug (bool) – (optional) Set to True to print headers to stdout and exit exit without sending the delete request.

File Objects

get_files(identifier, files=None, formats=None, glob_pattern=None, on_the_fly=None, **get_item_kwargs)

Get File objects from an item.

Parameters:
  • identifier (str) – The globally unique Archive.org identifier for a given item.
  • files – iterable
  • files – (optional) Only return files matching the given filenames.
  • formats – iterable
  • formats – (optional) Only return files matching the given formats.
  • glob_pattern (str) – (optional) Only return files matching the given glob pattern.
  • on_the_fly (bool) – (optional) Include on-the-fly files (i.e. derivative EPUB, MOBI, DAISY files).
  • **get_item_kwargs – (optional) Arguments that get_item() takes.
Usage:
>>> from internetarchive import get_files
>>> fnames = [f.name for f in get_files('nasa', glob_pattern='*xml')]
>>> print(fnames)
['nasa_reviews.xml', 'nasa_meta.xml', 'nasa_files.xml']

Searching Items

search_items(query, fields=None, sorts=None, params=None, archive_session=None, config=None, config_file=None, http_adapter_kwargs=None, request_kwargs=None, max_retries=None)

Search for items on Archive.org.

Parameters:
  • query (str) – The Archive.org search query to yield results for. Refer to https://archive.org/advancedsearch.php#raw for help formatting your query.
  • fields (list) – (optional) The metadata fields to return in the search results.
  • params (dict) – (optional) The URL parameters to send with each request sent to the Archive.org Advancedsearch Api.
  • secure – (optional) Configuration options for session.
  • config_file (str) – (optional) A path to a config file used to configure your session.
  • http_adapter_kwargs (dict) – (optional) Keyword arguments that requests.adapters.HTTPAdapter takes.
  • request_kwargs (dict) – (optional) Keyword arguments that requests.Request takes.
  • max_retries (int, object) –

    The number of times to retry a failed request. This can also be an urllib3.Retry object. If you need more control (e.g. status_forcelist), use a ArchiveSession object, and mount your own adapter after the session object has been initialized. For example:

    >>> s = get_session()
    >>> s.mount_http_adapter()
    >>> search_results = s.search_items('nasa')
    

    See ArchiveSession.mount_http_adapter() for more details.

Returns:

A Search object, yielding search results.

Internet Archive Tasks

get_tasks(identifier=None, task_ids=None, task_type=None, params=None, config=None, config_file=None, verbose=None, archive_session=None, http_adapter_kwargs=None, request_kwargs=None)

Get tasks from the Archive.org catalog. internetarchive must be configured with your logged-in-* cookies to use this function. If no arguments are provided, all queued tasks for the user will be returned.

Parameters:
  • identifier (str) – (optional) The Archive.org identifier for which to retrieve tasks for.
  • task_ids (int or str) – (optional) The task_ids to retrieve from the Archive.org catalog.
  • task_type (str) – (optional) The type of tasks to retrieve from the Archive.org catalog. The types can be either “red” for failed tasks, “blue” for running tasks, “green” for pending tasks, “brown” for paused tasks, or “purple” for completed tasks.
  • params (dict) – (optional) The URL parameters to send with each request sent to the Archive.org catalog API.
  • secure – (optional) Configuration options for session.
  • verbose (bool) – (optional) Set to True to retrieve verbose information for each catalog task returned. verbose is set to True by default.
Returns:

A set of CatalogTask objects.