the 4Vs of data
volume, velocity, variety, veracity
(how much data is stored, how fast can it be accessed, what kind of data is stored, what is the quality and accuracy of the data)
volume of big data
the size of the data that is stored and available for access and processing
max. handling capacity of a single server in 2025
up to around one Petabyte
(anything above this needs to be stored on a distributed system)
abbreviation and storage space of Kilobyte
KB, 1000 B, 8000 bits
1 kibibyte (KiB) is 1024 B (2¹⁰)
units of data volume from and above gigabyte
gigabyte, terabyte, petabyte, exabyte, zettabyte, yottabyte (in ascending order with steps *1024)
transfer speed (definition, unit)
a measure of how much data is transferred for each time unit, e. g. for each second; b/s (Bps), Kb/s (Kbps), Mb/s (Mbps), Gb/s (Gbps)
What can we do when the slowest component of the server (usually the disk) reaches its transfer speed limit?
We can accelerate the workload by implementing a distributed system, so we can have data written to many servers across a server pool.
response time (definition, unit)
the time it takes for a database to respond to an access or storage request, ms
What is the use of messages in big data?
They are used for transmitting data in low-velocity IoT applications before it is ingested into DB systems.
(Messaging systems come with a queue to offer support for scenarios with unstable internet connection. Newly accumulated data is added to the queue, where it is stored until the device comes back online.)
FIFO
first-in, first-out methodology
eventual consistency
A concept of improving DB read and write performance. When it is applied, the data is initially only written to one node, then replicated across others.
<-> strong consistency
-> Data can’t be assumed to be up to date in a distributed DB.
What does variety describe in big data?
the different types of data present: structured, semi-structured, unstructured
What kind of data can cause veracity issues in the field of big data?
inconsistent, untrusted, raw/uncleansed, biased, incomplete etc.
What dimensions extensions to the 4Vs contain?
variability, exhaustivity, fine-grained, relationality, resolution & indexicality, extensionality & scalability, value
data mining in big data
the process of finding, extracting and processing data
psycopg
the most popular PostgreSQL adapter for the Python programming language
tweepy
a Python library that enables easy Twitter API access
Twitter OAuthing in Python
auth = tweepy.OAuthHandler(consumer_key, consumer_secret) auth.set_access_token(access_token, access_token_secret) api = tweepy.API(auth)
After the fist line, the auth variable refers to an instance of the class tweepy.OAuthHandler.
What can we use the api object resulting from api = tweepy.API(auth) in Python for?
to authenticate search requests against the API
How do I paginate through tweets by hashtag and start date using tweepy v2?
from datetime import datetime, timezone
import os, tweepy, mysql.connector
———- 1. Authenticate —————————————-
client = tweepy.Client(
bearer_token=os.environ[“X_BEARER_TOKEN”],
wait_on_rate_limit=True,
)
———- 2. Build query ——————————————
query = “#InterestingHashtag lang:en -is:retweet”
Earliest date wanted, inclusive (UTC)
start_time = datetime(2024, 1, 1, tzinfo=timezone.utc)
———- 3. Choose endpoint by access tier ———————–
# search_all_tweets → Full Archive (Academic — effectively dead)
# search_recent_tweets → Last 7 days (Basic $100/mo+, max 100/page)
search_method = client.search_recent_tweets
———- 4. Iterate with Paginator ——————————-
db_params = dict(host=”localhost”, database=”mydb”,
user=”root”, password=”secret”)
sql = “INSERT IGNORE INTO tweets(id, text) VALUES (%s, %s)”
BATCH = 200
cxn = mysql.connector.connect(**db_params)
cur = cxn.cursor()
try:
paginator = tweepy.Paginator(
search_method,
query=query,
start_time=start_time.isoformat(), # YYYY-MM-DDTHH:MM:SS+00:00
tweet_fields=[“id”, “text”],
max_results=100, # 10-100 for recent
)
count = 0
for tweet in paginator.flatten(limit=None):
cur.execute(sql, (tweet.id, tweet.text))
count += 1
if count % BATCH == 0:
cxn.commit()
cxn.commit() # flush remainder
finally:
cur.close()
cxn.close()
pub/sub
a system of publishers (IoT edge devices) and subscribers (brokers that make data available to clients)
What does a message broker do in Industry 4.0?
It handles delivery of messages after periods of downtime on the subscriber’s end.
What are the components of the pub/sub pattern?
IoT machinery -> message broker -> DB
SWIFT
Society of Wordwide Interbank Financial Telecommunications
(payment processing system acting as a messaging service)