Dataset Construction
This dataset captures the temporal network of Bitcoin (BTC) flow exchanged between entities at the finest time resolution in UNIX timestamp. Its construction is based on the blockchain covering the period from January, 3rd of 2009 to January the 25th of 2021. The blockchain extraction has been made using bitcoin-etl (https://github.com/blockchain-etl/bitcoin-etl) Python package. The entity-entity network is built by aggregating Bitcoin addresses using the common-input heuristic [1] as well as popular Bitcoin users' addresses provided by https://www.walletexplorer.com/
[1] M. Harrigan and C. Fretter, "The Unreasonable Effectiveness of Address Clustering," 2016 Intl IEEE Conferences on Ubiquitous Intelligence & Computing, Advanced and Trusted Computing, Scalable Computing and Communications, Cloud and Big Data Computing, Internet of People, and Smart World Congress (UIC/ATC/ScalCom/CBDCom/IoP/SmartWorld), Toulouse, France, 2016, pp. 368-373, doi: 10.1109/UIC-ATC-ScalCom-CBDCom-IoP-SmartWorld.2016.0071.keywords: {Online banking;Merging;Protocols;Upper bound;Bipartite graph;Electronic mail;Size measurement;bitcoin;cryptocurrency;blockchain},
Dataset Description
Bitcoin Activity Temporal Coverage: From 03 January 2009 to 25 January 2021
Overview:
This dataset provides a comprehensive representation of Bitcoin exchanges between entities over a significant temporal span, spanning from the inception of Bitcoin to recent years. It encompasses various temporal resolutions and representations to facilitate Bitcoin transaction network analysis in the context of temporal graphs.
Every dates have been retrieved from bloc UNIX timestamp and GMT timezone.
Contents:
The dataset is distributed across three compressed archives:
All data are stored in the Apache Parquet file format, a columnar storage format optimized for analytical queries. It can be used with pyspark Python package.
orbitaal-stream_graph.tar.gz:
The root directory is STREAM_GRAPH/
Contains a stream graph representation of Bitcoin exchanges at the finest temporal scale, corresponding to the validation time of each block (averaging approximately 10 minutes).
The stream graph is divided into 13 files, one for each year
Files format is parquet
Name format is orbitaal-stream_graph-date-[YYYY]-file-id-[ID].snappy.parquet, where [YYYY] stands for the corresponding year and [ID] is an integer from 1 to N (number of files here) such as sorting in increasing [ID] ordering is similar to sort by increasing year ordering
These files are in the subdirectory STREAM_GRAPH/EDGES/
orbitaal-snapshot-all.tar.gz:
The root directory is SNAPSHOT/
Contains the snapshot network representing all transactions aggregated over the whole dataset period (from Jan. 2009 to Jan. 2021).
Files format is parquet
Name format is orbitaal-snapshot-all.snappy.parquet.
These files are in the subdirectory SNAPSHOT/EDGES/ALL/
orbitaal-snapshot-year.tar.gz:
The root directory is SNAPSHOT/
Contains the yearly resolution of snapshot networks
Files format is parquet
Name format is orbitaal-snapshot-date-[YYYY]-file-id-[ID].snappy.parquet, where [YYYY] stands for the corresponding year and [ID] is an integer from 1 to N (number of files here) such as sorting in increasing [ID] ordering is similar to sort by increasing year ordering
These files are in the subdirectory SNAPSHOT/EDGES/year/
orbitaal-snapshot-month.tar.gz:
The root directory is SNAPSHOT/
Contains the monthly resoluted snapshot networks
Files format is parquet
Name format is orbitaal-snapshot-date-[YYYY]-[MM]-file-id-[ID].snappy.parquet, where
[YYYY] and [MM] stands for the corresponding year and month, and [ID] is an integer from 1 to N (number of files here) such as sorting in increasing [ID] ordering is similar to sort by increasing year and month ordering
These files are in the subdirectory SNAPSHOT/EDGES/month/
orbitaal-snapshot-day.tar.gz:
The root directory is SNAPSHOT/
Contains the daily resoluted snapshot networks
Files format is parquet
Name format is orbitaal-snapshot-date-[YYYY]-[MM]-[DD]-file-id-[ID].snappy.parquet, where
[YYYY], [MM], and [DD] stand for the corresponding year, month, and day, and [ID] is an integer from 1 to N (number of files here) such as sorting in increasing [ID] ordering is similar to sort by increasing year, month, and day ordering
These files are in the subdirectory SNAPSHOT/EDGES/day/
orbitaal-snapshot-hour.tar.gz:
The root directory is SNAPSHOT/
Contains the hourly resoluted snapshot networks
Files format is parquet
Name format is orbitaal-snapshot-date-[YYYY]-[MM]-[DD]-[hh]-file-id-[ID].snappy.parquet, where
[YYYY], [MM], [DD], and [hh] stand for the corresponding year, month, day, and hour, and [ID] is an integer from 1 to N (number of files here) such as sorting in increasing [ID] ordering is similar to sort by increasing year, month, day and hour ordering
These files are in the subdirectory SNAPSHOT/EDGES/hour/
orbitaal-nodetable.tar.gz:
The root directory is NODE_TABLE/
Contains two files in parquet format, the first one gives information related to nodes present in stream graphs and snapshots such as period of activity and associated global Bitcoin balance, and the other one contains the list of all associated Bitcoin addresses.
Small samples in CSV format
orbitaal-stream_graph-2016_07_08.csv and orbitaal-stream_graph-2016_07_09.csv
These two CSV files are related to stream graph representations of an halvening happening in 2016.
orbitaal-snapshot-2016_07_08.csv and orbitaal-snapshot-2016_07_09.csv
These two CSV files are related to daily snapshot representations of an halvening happening in 2016.