In this section, we will clean the aggregated WiFi data to ensure it is suitable for analysis. We will start by loading the aggregated data from CSV files, combine them, and then apply various filtering techniques to remove random MAC addresses and non-mobile devices.
8.1 Loading and Combining Aggregated Data
We will load the aggregated data from the previously created CSV files and combine them into a single dataset.
pacman::p_load(data.table, lubridate)# List all CSV files in the ch3 folder that contain "1second" in their filenamescsv_files <-list.files("material/ch3", pattern ="1second.*\\.csv$", full.names =TRUE)# Load aggregated data from the CSV files and combine themwf_combined <-rbindlist(lapply(csv_files, fread))# Print the first few rows to verifyhead(wf_combined, 5)
Random MAC addresses can lead to duplicate counting and inconsistent data. We will identify and remove entries with random MAC addresses where source_address_randomized is 1.
# Remove entries with random MAC addresseswf_filtered_1 <- wf_combined[source_address_randomized !=1]
8.2.2 Removal of Non-Mobile Devices
We will filter out non-mobile devices from our dataset. Non-mobile devices typically exhibit characteristics such as being detected by fewer sensors and being present in the same location for extended periods. We will use two main criteria to identify and remove these non-mobile devices:
Detected by fewer sensors: Devices detected by a small number of sensors are likely to be non-mobile.
Detected for long durations: Devices detected for long periods in the same location are also likely to be non-mobile.
The following steps outline the process of identifying and removing non-mobile devices using these criteria.
Step 1: Define Filtering Thresholds
Define the thresholds for filtering. You can adjust these values based on your dataset and analysis needs.
# Define thresholds for filtering non-mobile devicessensor_threshold_percentile <-0.3# Devices detected by fewer sensors: the 30th percentilecontinuous_detection_threshold <-300# 5 minutes (300 seconds)total_detection_time_threshold <-3600*2# Total 2 hours (7200 seconds)
Step 2: Calculate the Number of Unique Sensors
Calculate the number of unique sensors that detected each device.
# Calculate the number of unique sensors detecting each devicedevice_sensor_counts <- wf_filtered_1[, .(sensor_count =uniqueN(sensor_name)), by = source_address]
Step 3: Identify Devices Detected by Fewer Sensors
Identify devices that are detected by a small number of sensors, below the defined threshold.
# Identify devices detected by fewer sensors (below the specified percentile)low_detection_threshold <-quantile(device_sensor_counts$sensor_count, sensor_threshold_percentile)low_detection_devices <- device_sensor_counts[sensor_count <= low_detection_threshold, source_address]# Filter dataset to keep only devices detected by fewer sensorswf_filtered_1 <- wf_filtered_1[source_address %in% low_detection_devices]
Step 4: Calculate Time Differences Between Detections
Calculate the time difference between consecutive detections for each device.
# Calculate the time difference between consecutive detectionswf_filtered_1[, time_diff :=as.numeric(difftime(timestamp, shift(timestamp, type ="lag"), units ="secs")), by = .(source_address)]
Step 5: Identify Continuous Detection Periods
Identify periods of continuous detection based on the defined threshold.
Calculate the total continuous detection time for each period.
# Calculate total continuous detection time for each periodwf_filtered_1[, total_continuous_time :=sum(time_diff[time_diff <= continuous_detection_threshold], na.rm =TRUE), by = .(sensor_name, source_address, continuous_detection)]
Step 7: Remove Non-Mobile Devices
Identify and remove non-mobile devices based on the total continuous detection time.
# Identify and remove non-mobile devicesnon_mobile_devices <- wf_filtered_1[total_continuous_time >= total_detection_time_threshold, .(source_address)]non_mobile_devices <-unique(non_mobile_devices$source_address)# Filter out non-mobile devices from the datasetwf_filtered_3 <- wf_filtered_1[!source_address %in% non_mobile_devices]
This updated method allows for a more nuanced identification of non-mobile devices, ensuring that devices detected by fewer sensors and for extended periods are accurately filtered out. Adjust the continuous_detection_threshold and total_detection_time_threshold parameters as needed based on the specifics of your dataset and analysis requirements.
8.2.3 Verification of Filtering Steps
To ensure that the filtering steps have been correctly applied, we will summarize the number of unique devices and the total detection times at each step.
# Create a summary table to verify filtering stepsfilter_table <-data.table(step =c("Initial", "After Removal of Random MACs", "After Removal of Non-Mobile Devices (Final)"),unique_MAC =c(length(unique(wf_combined$source_address)), # Number of unique devices after each steplength(unique(wf_filtered_1$source_address)),length(unique(wf_filtered_3$source_address))), # After final filteringdetection_time =c(nrow(wf_combined), # Total detection times after each stepnrow(wf_filtered_1),nrow(wf_filtered_3)) # After final filtering)# Print the summary table to verify filtering stepsprint(filter_table)
step unique_MAC detection_time
<char> <int> <int>
1: Initial 243 2694
2: After Removal of Random MACs 151 859
3: After Removal of Non-Mobile Devices (Final) 151 859
The verification table provides a summary of the filtering process, showing the impact of each filtering step:
Initial: The combined dataset contains 243 unique MAC addresses with a total of 2,694 detection times.
After Removal of Random MACs: Removing random MAC addresses reduces the dataset to 151 unique MAC addresses and 859 detection times.
After Removal of Non-Mobile Devices (Final): The final step retains 151 unique MAC addresses and 859 detection times, indicating that no devices met the non-mobile criteria within this dataset’s timeframe.
The reason for no reduction in the number of non-mobile devices in step 3 is likely due to the short duration of the example dataset. Non-mobile devices are typically identified over longer periods (e.g., 12 hours), and this dataset may not provide enough data to meet that criterion.
8.3 Efficient Data Cleaning with a Function
To streamline the data cleaning process, we will create a function that performs all the filtering steps in one go. This function will:
Load the aggregated data from the specified CSV files.
Remove entries with random MAC addresses.
Identify and remove non-mobile devices based on the defined thresholds.
Return the cleaned dataset.
# Load necessary packagesif (!require(pacman)) install.packages("pacman")pacman::p_load(data.table, lubridate)# Define the data cleaning functionclean_wifi_data <-function(csv_files, sensor_threshold_percentile, continuous_detection_threshold, total_detection_time_threshold) {# Load and combine the aggregated data wf_combined <-rbindlist(lapply(csv_files, fread))# Step 1: Remove entries with random MAC addresses wf_filtered_1 <- wf_combined[source_address_randomized !=1]# Step 2: Identify and remove non-mobile devices device_sensor_counts <- wf_filtered_1[, .(sensor_count =uniqueN(sensor_name)), by = source_address] low_detection_threshold <-quantile(device_sensor_counts$sensor_count, sensor_threshold_percentile) low_detection_devices <- device_sensor_counts[sensor_count <= low_detection_threshold, source_address] wf_filtered_2 <- wf_filtered_1[source_address %in% low_detection_devices] wf_filtered_2[, time_diff :=as.numeric(difftime(timestamp, shift(timestamp, type ="lag"), units ="secs")), by = .(source_address)] wf_filtered_2[, continuous_detection :=cumsum(time_diff > continuous_detection_threshold |is.na(time_diff)), by = .(source_address)] wf_filtered_2[, total_continuous_time :=sum(time_diff[time_diff <= continuous_detection_threshold], na.rm =TRUE), by = .(sensor_name, source_address, continuous_detection)] non_mobile_devices <- wf_filtered_2[total_continuous_time >= total_detection_time_threshold, .(source_address)] non_mobile_devices <-unique(non_mobile_devices$source_address) wf_final <- wf_filtered_1[!source_address %in% non_mobile_devices]return(wf_final)}# Define parameterssensor_threshold_percentile <-0.3# Devices detected by fewer sensors: the 30th percentilecontinuous_detection_threshold <-300# 5 minutes (300 seconds)total_detection_time_threshold <-3600*2# Total 2 hours (7200 seconds)# Example usagecsv_files <-list.files("material/ch3", pattern ="1second.*\\.csv$", full.names =TRUE)cleaned_data <-clean_wifi_data(csv_files, sensor_threshold_percentile, continuous_detection_threshold, total_detection_time_threshold)# Print the first few rows of the cleaned data to verifyhead(cleaned_data)
This function automates the entire data cleaning process, making it easier and faster to clean large datasets. Adjust the parameters as needed based on your specific dataset and analysis requirements.