Concepts and terminology
A number of concepts and terms are used when discussing research data storage. This page provides an overview of some of the most common concepts and terms.
Storage types
There are many different types of storage. Some of the most common types of storage are:
Object storage
Object storage is a type of storage that stores data as objects. Each object contains data, metadata, and a unique identifier. Objects are stored in a flat address space and can be accessed using a URL. Object storage is designed to store large amounts of unstructured data. Object storage is not designed to be used as a file system. Object storage is typically accessed using a REST API. Object storage is typically used for storing data that is not actively being worked on but needs to be stored for a long time. Amazon S3 and Azure Blobs are examples of object storage services.
Block storage
Block storage is a type of storage that stores data as blocks. Each block contains data and a unique identifier. Blocks are stored in a hierarchical address space and can be accessed using a file system. Block storage is designed to store large amounts of structured data. Block storage is typically used for storing data that is actively being worked on. Amazon EBS is an example of block storage.
File storage
File storage is a type of storage that stores data as files. Your computer’s local filesystem uses file storage. Network filesystems such as NFS and SMB, and cloud storage services such as Azure Files and Amazon EFS also use file storage. File storage is designed to store large amounts of structured data. File storage is designed to be used by multiple users at the same time and supports creating, reading, updating, and deleting files. File storage is typically used for storing data that is actively being worked on. Azure Files is an example of file storage.
Sync services
Cloud-based file sync services, such as Dropbox, Google Drive, and OneDrive, offer file storage that can easily be accessed from multiple devices. These services usually offer convenient features such as file versioning, recovery, and file sharing. You can access the files stored in these services from a directory on your computer, and the files are automatically synchronized with the cloud storage service. However, the services may not support all of the features of a traditional file system (e.g., file permissions, symbolic links) and often have restrictions on the length of file names and the types of characters that can be used in file names, which may cause serious problems when using files stored in these services with some applications (e.g., FreeSurfer). Caution is advised when working with files stored in these services.
Version control systems
Version control systems, such as Git, are designed to store and manage different versions of files. Version control systems are typically used for storing source code but can also be used for storing other types of files. There are some version control systems that are designed to handle datasets, such as DVC. GitHub provides online hosting and collaboration services for Git repositories.
Data repositories
Data repositories, such as ResearchWorks Archive and OSF, are designed to store and manage research data and their metadata. Data repositories are typically used for storing data that is being shared with other researchers. Data repositories often provide features such as versioning, recovery, sharing, and embargos. Data repositories often provide a DOI for each dataset, which can be used to cite the dataset in publications.
Characteristics and considerations
There are a number of characteristics and considerations that are important when choosing a storage service, and these may be reflected in the upfront and ongoing costs. Some of the most common characteristics and considerations are:
Storage capacity
How much data is being stored. The more data that is being stored, the more it will cost to store the data. Storage services may have minimum and/or maximum storage capacity requirements. Some services may charge for storage in increments of a certain size (e.g., 1 TiB).
Storage latency
How quickly data can start being read or written. The lower the latency, the faster data can be read or written. Storage services may charge more for lower latency. A solid-state drive (SSD) has lower latency than a hard disk drive (HDD), which has lower latency than a tape drive. However, storage capacity on SSDs is more expensive than storage capacity on HDDs, which is more expensive than storage capacity on tape drives.
Data transfer
The rate at which data can be downloaded from or uploaded to a service may be limited. The higher the bandwidth, the faster data can be downloaded from or uploaded to the service. Service providers may charge more for higher bandwidth.
Data transfer costs may be charged for uploading data to and downloading data from a service, also known as ingress and egress. Typically, ingress is free and egress is charged. Some services may charge more for uploading data to and downloading data from certain locations (e.g., outside of the United States). Some providers offer options such as physical media transfer (e.g., mailing a hard drive) for uploading and downloading large amounts of data (see AWS Snowball).
Retention period
How long data needs to be stored. The longer data needs to be stored, the more it will cost to store the data. Billing for storage services is typically done on a monthly or annual basis.
Backup and recovery options
How often data is backed up, how long backups are kept, and how quickly data can be recovered from backups. The more often data is backed up, the longer backups are kept, and the faster data can be recovered from backups, the more it will cost to store the data. Backup and recovery options may not be available or involve extra charges on some services.
Frequency of access
How often data is accessed. Some services may charge for each time data is accessed or charge more for more frequent access.
Data access restrictions
Who can access the data. Data access restrictions may be inherent to a service (e.g., only people connected to the UW intranet can access the data) or may be imposed by the user, the organization, or the service provider. Sharing data with outside collaborators may require additional steps, such as creating accounts for the collaborators, or may not be possible at all.
Data security
How secure the data is against unauthorized access. While access restrictions can limit who can access the data in certain circumstances, measures such as encryption may be necessary to prevent access by users who have access to the data but should not be able to access the data (e.g., system administrators).
Research data that is subject to HIPAA or FERPA regulations must be stored in a HIPAA- or FERPA-compliant service. For more information about HIPAA at UW, see the HIPPA Guidance page.