8 Data Cleaning

In this section, we will clean the aggregated WiFi data to ensure it is suitable for analysis. We will start by loading the aggregated data from CSV files, combine them, and then apply various filtering techniques to remove random MAC addresses and non-mobile devices.

8.1 Loading and Combining Aggregated Data

We will load the aggregated data from the previously created CSV files and combine them into a single dataset.

# Load necessary packages
if (!require(pacman)) install.packages("pacman")

필요한 패키지를 로딩중입니다: pacman

pacman::p_load(data.table, lubridate)

# List all CSV files in the ch3 folder that contain "1second" in their filenames
csv_files <- list.files("material/ch3", pattern = "1second.*\\.csv$", full.names = TRUE)

# Load aggregated data from the CSV files and combine them
wf_combined <- rbindlist(lapply(csv_files, fread))

# Print the first few rows to verify
head(wf_combined, 5)

   sensor_name                                                   source_address
        <char>                                                           <char>
1:         A01 d94147cf12befe41bb40dd7957733c54442de7a9d45a75ec3c747856c4bdc129
2:         A01 5e69a0bc9bd73c0b72642e2e0f4f99670b85e8fdf4616bc19fb1f8d63107bfe5
3:         A01 05d29a432f4ff4c5f2e49e185334619d4365ef65370fcf9891bc7b1f8c0a68b6
4:         A01 a6a0a285818a48c083c72c885283f1652208b3239f70e859f49067b36781acc6
5:         A01 b3268f2d7ca90e7ea3ff549decbf484d478c3eaf28784a7bbfbd5aaee22d3a6a
   source_address_randomized           timestamp median_rssi count
                       <int>              <POSc>       <int> <int>
1:                         1 2024-04-09 19:17:27         -66     2
2:                         1 2024-04-09 19:17:27         -75     2
3:                         0 2024-04-09 19:17:27         -78     2
4:                         0 2024-04-09 19:17:27         -75     1
5:                         1 2024-04-09 19:17:27         -78     2

8.2 Filtering by Device Characteristics

8.2.1 Removal of Random MAC Addresses

Random MAC addresses can lead to duplicate counting and inconsistent data. We will identify and remove entries with random MAC addresses where source_address_randomized is 1.

# Remove entries with random MAC addresses
wf_filtered_1 <- wf_combined[source_address_randomized != 1]

8.2.2 Removal of Non-Mobile Devices

We will filter out non-mobile devices from our dataset. Non-mobile devices typically exhibit characteristics such as being detected by fewer sensors and being present in the same location for extended periods. We will use two main criteria to identify and remove these non-mobile devices:

Detected by fewer sensors: Devices detected by a small number of sensors are likely to be non-mobile.
Detected for long durations: Devices detected for long periods in the same location are also likely to be non-mobile.

The following steps outline the process of identifying and removing non-mobile devices using these criteria.

Step 1: Define Filtering Thresholds

Define the thresholds for filtering. You can adjust these values based on your dataset and analysis needs.

# Define thresholds for filtering non-mobile devices
sensor_threshold_percentile <- 0.3  # Devices detected by fewer sensors: the 30th percentile
continuous_detection_threshold <- 300  # 5 minutes (300 seconds)
total_detection_time_threshold <- 3600 * 2  # Total 2 hours (7200 seconds)

Step 2: Calculate the Number of Unique Sensors

Calculate the number of unique sensors that detected each device.

# Calculate the number of unique sensors detecting each device
device_sensor_counts <- wf_filtered_1[, .(sensor_count = uniqueN(sensor_name)), by = source_address]

Step 3: Identify Devices Detected by Fewer Sensors

Identify devices that are detected by a small number of sensors, below the defined threshold.

# Identify devices detected by fewer sensors (below the specified percentile)
low_detection_threshold <- quantile(device_sensor_counts$sensor_count, sensor_threshold_percentile)
low_detection_devices <- device_sensor_counts[sensor_count <= low_detection_threshold, source_address]

# Filter dataset to keep only devices detected by fewer sensors
wf_filtered_1 <- wf_filtered_1[source_address %in% low_detection_devices]

Step 4: Calculate Time Differences Between Detections

Calculate the time difference between consecutive detections for each device.

# Calculate the time difference between consecutive detections
wf_filtered_1[, time_diff := as.numeric(difftime(timestamp, shift(timestamp, type = "lag"), units = "secs")), by = .(source_address)]

Step 5: Identify Continuous Detection Periods

Identify periods of continuous detection based on the defined threshold.

# Identify continuous detection periods
wf_filtered_1[, continuous_detection := cumsum(time_diff > continuous_detection_threshold | is.na(time_diff)), by = .(source_address)]

Step 6: Calculate Total Continuous Detection Time

Calculate the total continuous detection time for each period.

# Calculate total continuous detection time for each period
wf_filtered_1[, total_continuous_time := sum(time_diff[time_diff <= continuous_detection_threshold], na.rm = TRUE), by = .(sensor_name, source_address, continuous_detection)]

Step 7: Remove Non-Mobile Devices

Identify and remove non-mobile devices based on the total continuous detection time.

# Identify and remove non-mobile devices
non_mobile_devices <- wf_filtered_1[total_continuous_time >= total_detection_time_threshold, .(source_address)]
non_mobile_devices <- unique(non_mobile_devices$source_address)

# Filter out non-mobile devices from the dataset
wf_filtered_3 <- wf_filtered_1[!source_address %in% non_mobile_devices]

This updated method allows for a more nuanced identification of non-mobile devices, ensuring that devices detected by fewer sensors and for extended periods are accurately filtered out. Adjust the continuous_detection_threshold and total_detection_time_threshold parameters as needed based on the specifics of your dataset and analysis requirements.

8.2.3 Verification of Filtering Steps

To ensure that the filtering steps have been correctly applied, we will summarize the number of unique devices and the total detection times at each step.

# Create a summary table to verify filtering steps
filter_table <- data.table(
  step = c("Initial", "After Removal of Random MACs", "After Removal of Non-Mobile Devices (Final)"),
  unique_MAC = c(length(unique(wf_combined$source_address)),  # Number of unique devices after each step
                 length(unique(wf_filtered_1$source_address)),
                 length(unique(wf_filtered_3$source_address))),  # After final filtering
  detection_time = c(nrow(wf_combined),  # Total detection times after each step
                     nrow(wf_filtered_1),
                     nrow(wf_filtered_3))  # After final filtering
)

# Print the summary table to verify filtering steps
print(filter_table)

                                          step unique_MAC detection_time
                                        <char>      <int>          <int>
1:                                     Initial        243           2694
2:                After Removal of Random MACs        151            859
3: After Removal of Non-Mobile Devices (Final)        151            859

The verification table provides a summary of the filtering process, showing the impact of each filtering step:

Initial: The combined dataset contains 243 unique MAC addresses with a total of 2,694 detection times.
After Removal of Random MACs: Removing random MAC addresses reduces the dataset to 151 unique MAC addresses and 859 detection times.
After Removal of Non-Mobile Devices (Final): The final step retains 151 unique MAC addresses and 859 detection times, indicating that no devices met the non-mobile criteria within this dataset’s timeframe.

The reason for no reduction in the number of non-mobile devices in step 3 is likely due to the short duration of the example dataset. Non-mobile devices are typically identified over longer periods (e.g., 12 hours), and this dataset may not provide enough data to meet that criterion.

8.3 Efficient Data Cleaning with a Function

To streamline the data cleaning process, we will create a function that performs all the filtering steps in one go. This function will:

Load the aggregated data from the specified CSV files.
Remove entries with random MAC addresses.
Identify and remove non-mobile devices based on the defined thresholds.
Return the cleaned dataset.

# Load necessary packages
if (!require(pacman)) install.packages("pacman")
pacman::p_load(data.table, lubridate)

# Define the data cleaning function
clean_wifi_data <- function(csv_files, sensor_threshold_percentile, continuous_detection_threshold, total_detection_time_threshold) {
  # Load and combine the aggregated data
  wf_combined <- rbindlist(lapply(csv_files, fread))
  
  # Step 1: Remove entries with random MAC addresses
  wf_filtered_1 <- wf_combined[source_address_randomized != 1]
  
  # Step 2: Identify and remove non-mobile devices
  device_sensor_counts <- wf_filtered_1[, .(sensor_count = uniqueN(sensor_name)), by = source_address]
  low_detection_threshold <- quantile(device_sensor_counts$sensor_count, sensor_threshold_percentile)
  low_detection_devices <- device_sensor_counts[sensor_count <= low_detection_threshold, source_address]
  
  wf_filtered_2 <- wf_filtered_1[source_address %in% low_detection_devices]
  wf_filtered_2[, time_diff := as.numeric(difftime(timestamp, shift(timestamp, type = "lag"), units = "secs")), by = .(source_address)]
  wf_filtered_2[, continuous_detection := cumsum(time_diff > continuous_detection_threshold | is.na(time_diff)), by = .(source_address)]
  wf_filtered_2[, total_continuous_time := sum(time_diff[time_diff <= continuous_detection_threshold], na.rm = TRUE), by = .(sensor_name, source_address, continuous_detection)]
  non_mobile_devices <- wf_filtered_2[total_continuous_time >= total_detection_time_threshold, .(source_address)]
  non_mobile_devices <- unique(non_mobile_devices$source_address)
  
  wf_final <- wf_filtered_1[!source_address %in% non_mobile_devices]
  
  return(wf_final)
}

# Define parameters
sensor_threshold_percentile <- 0.3  # Devices detected by fewer sensors: the 30th percentile
continuous_detection_threshold <- 300  # 5 minutes (300 seconds)
total_detection_time_threshold <- 3600 * 2  # Total 2 hours (7200 seconds)

# Example usage
csv_files <- list.files("material/ch3", pattern = "1second.*\\.csv$", full.names = TRUE)
cleaned_data <- clean_wifi_data(csv_files, sensor_threshold_percentile, continuous_detection_threshold, total_detection_time_threshold)

# Print the first few rows of the cleaned data to verify
head(cleaned_data)

   sensor_name                                                   source_address
        <char>                                                           <char>
1:         A01 05d29a432f4ff4c5f2e49e185334619d4365ef65370fcf9891bc7b1f8c0a68b6
2:         A01 a6a0a285818a48c083c72c885283f1652208b3239f70e859f49067b36781acc6
3:         A01 f6e4a5fce8432422779b9e68da551a19b24b749ddbd58735bd95334747258d66
4:         A01 c419f59d5917e15bd08b3ac2bd37467e8ecc076d44fc138856f3c9178718e118
5:         A01 14bb59b4ee06a8c1c27d72811d39d856ea9c98003ed1f563f9d3680cc1d9b63a
6:         A01 fa37ab8e615eb19b7a2cdbcea288c51271da5f418e74d892597e9b0b6226f118
   source_address_randomized           timestamp median_rssi count
                       <int>              <POSc>       <int> <int>
1:                         0 2024-04-09 19:17:27         -78     2
2:                         0 2024-04-09 19:17:27         -75     1
3:                         0 2024-04-09 19:17:28         -76     2
4:                         0 2024-04-09 19:17:28         -71     1
5:                         0 2024-04-09 19:17:29         -73     1
6:                         0 2024-04-09 19:17:29         -73     1

This function automates the entire data cleaning process, making it easier and faster to clean large datasets. Adjust the parameters as needed based on your specific dataset and analysis requirements.