ORBITAAL: cOmpRehensive BItcoin daTaset for temorAl grAph anaLysis

Dataset Construction

This dataset captures the temporal network of Bitcoin (BTC) flow exchanged between entities at the finest time resolution in UNIX timestamp. Its construction is based on the blockchain covering the period from January, 3rd of 2009 to January the 25th of 2021. The blockchain extraction has been made using bitcoin-etl (https://github.com/blockchain-etl/bitcoin-etl) Python package. The entity-entity network is built by aggregating Bitcoin addresses using the common-input heuristic [1] as well as popular Bitcoin users' addresses provided by https://www.walletexplorer.com/

[1] M. Harrigan and C. Fretter, "The Unreasonable Effectiveness of Address Clustering," 2016 Intl IEEE Conferences on Ubiquitous Intelligence & Computing, Advanced and Trusted Computing, Scalable Computing and Communications, Cloud and Big Data Computing, Internet of People, and Smart World Congress (UIC/ATC/ScalCom/CBDCom/IoP/SmartWorld), Toulouse, France, 2016, pp. 368-373, doi: 10.1109/UIC-ATC-ScalCom-CBDCom-IoP-SmartWorld.2016.0071.keywords: {Online banking;Merging;Protocols;Upper bound;Bipartite graph;Electronic mail;Size measurement;bitcoin;cryptocurrency;blockchain},

Dataset Description

Bitcoin Activity Temporal Coverage: From 03 January 2009 to 25 January 2021

Overview:

This dataset provides a comprehensive representation of Bitcoin exchanges between entities over a significant temporal span, spanning from the inception of Bitcoin to recent years. It encompasses various temporal resolutions and representations to facilitate Bitcoin transaction network analysis in the context of temporal graphs.

Every dates have been retrieved from bloc UNIX timestamp and GMT timezone.

Contents:

The dataset is distributed across three compressed archives:

All data are stored in the Apache Parquet file format, a columnar storage format optimized for analytical queries. It can be used with pyspark Python package.

orbitaal-stream_graph.tar.gz:

The root directory is STREAM_GRAPH/

Contains a stream graph representation of Bitcoin exchanges at the finest temporal scale, corresponding to the validation time of each block (averaging approximately 10 minutes).

The stream graph is divided into 13 files, one for each year

Files format is parquet

Name format is orbitaal-stream_graph-date-[YYYY]-file-id-[ID].snappy.parquet, where [YYYY] stands for the corresponding year and [ID] is an integer from 1 to N (number of files here) such as sorting in increasing [ID] ordering is similar to sort by increasing year ordering

These files are in the subdirectory STREAM_GRAPH/EDGES/

orbitaal-snapshot-all.tar.gz:

The root directory is SNAPSHOT/

Contains the snapshot network representing all transactions aggregated over the whole dataset period (from Jan. 2009 to Jan. 2021).

Files format is parquet

Name format is orbitaal-snapshot-all.snappy.parquet.

These files are in the subdirectory SNAPSHOT/EDGES/ALL/

orbitaal-snapshot-year.tar.gz:

The root directory is SNAPSHOT/

Contains the yearly resolution of snapshot networks

Files format is parquet

Name format is orbitaal-snapshot-date-[YYYY]-file-id-[ID].snappy.parquet, where [YYYY] stands for the corresponding year and [ID] is an integer from 1 to N (number of files here) such as sorting in increasing [ID] ordering is similar to sort by increasing year ordering

These files are in the subdirectory SNAPSHOT/EDGES/year/

orbitaal-snapshot-month.tar.gz:

The root directory is SNAPSHOT/

Contains the monthly resoluted snapshot networks

Files format is parquet

Name format is orbitaal-snapshot-date-[YYYY]-[MM]-file-id-[ID].snappy.parquet, where

[YYYY] and [MM] stands for the corresponding year and month, and [ID] is an integer from 1 to N (number of files here) such as sorting in increasing [ID] ordering is similar to sort by increasing year and month ordering

These files are in the subdirectory SNAPSHOT/EDGES/month/

orbitaal-snapshot-day.tar.gz:

The root directory is SNAPSHOT/

Contains the daily resoluted snapshot networks

Files format is parquet

Name format is orbitaal-snapshot-date-[YYYY]-[MM]-[DD]-file-id-[ID].snappy.parquet, where

[YYYY], [MM], and [DD] stand for the corresponding year, month, and day, and [ID] is an integer from 1 to N (number of files here) such as sorting in increasing [ID] ordering is similar to sort by increasing year, month, and day ordering

These files are in the subdirectory SNAPSHOT/EDGES/day/

orbitaal-snapshot-hour.tar.gz:

The root directory is SNAPSHOT/

Contains the hourly resoluted snapshot networks

Files format is parquet

Name format is orbitaal-snapshot-date-[YYYY]-[MM]-[DD]-[hh]-file-id-[ID].snappy.parquet, where

[YYYY], [MM], [DD], and [hh] stand for the corresponding year, month, day, and hour, and [ID] is an integer from 1 to N (number of files here) such as sorting in increasing [ID] ordering is similar to sort by increasing year, month, day and hour ordering

These files are in the subdirectory SNAPSHOT/EDGES/hour/

orbitaal-nodetable.tar.gz:

The root directory is NODE_TABLE/

Contains two files in parquet format, the first one gives information related to nodes present in stream graphs and snapshots such as period of activity and associated global Bitcoin balance, and the other one contains the list of all associated Bitcoin addresses.

Small samples in CSV format

orbitaal-stream_graph-2016_07_08.csv and orbitaal-stream_graph-2016_07_09.csv

These two CSV files are related to stream graph representations of an halvening happening in 2016.

orbitaal-snapshot-2016_07_08.csv and orbitaal-snapshot-2016_07_09.csv

These two CSV files are related to daily snapshot representations of an halvening happening in 2016.

Data and Resources

Additional Info

Field Value
Source https://data.niaid.nih.gov/resources?id=zenodo_10844224
Last Updated December 4, 2024, 20:17 (UTC)
Created December 4, 2024, 20:16 (UTC)