This kind of data conversion is common in machine learning data preparation. Because IoT source data are commonly Multivariate Time Series data, for example in a Pandas data frame. However the ML required a Numpy array with time dimension in each time window.
There are some Python libraries provide rolling window mechanism, for example tsfresh (https://tsfresh.readthedocs.io/en/latest/text/forecasting.html)
Or you could write a loop to parse Pandas data frame, but it is very slow if your dataset is very large and contains multi stations data.
So I decide to explore some build-in functions in Numpy and Pandas to achieve more efficient data conversion.
- First, I create rank column based on time data.
2. I slide time windows across different stations with the same ranks. And concat the result together.
3. Now we have all the sliding windows stacking together. Let's split them to make the Numpy array.
The Features Array now has the shape of:
( Time Window Size * Time Window Count, Feature Count)
We already know the Time Window Size and Feature Count, so we could reshape the Numpy array accordingly:
np.reshape( x , (-1 , Time Window Size, Feature Count) )
Here, '-1' means we don't know the total Time Window Count, Numpy will calculate it automatically.
4. Now we get the array of time windows. We could use np.transpose to re-oder the dimensions.
5. Normally, we would like to keep the info of each time window. This could be done by Pandas data frame grouping.
First, we need add a column in [Step 2] data frame.
Then we group by Station and Rank Group and keep the original order. We create a new data frame from the grouped result. It would corresponds to each Time Window in the correct order.
Notes:
- This does not handle missing data imputation because it is very much case dependent.
- This article talks about similar problem, too.
https://towardsdatascience.com/fast-and-robust-sliding-window-vectorization-with-numpy-3ad950ed62f5