Understand the data
1. Understand the data
Welcome back.2. Store data to disk
After having aquired the data, we should store the data to disk for future analysis. This is common practice since not all IoT Data sources give us unlimited historical data, but also because we want reproducible results for our analysis. One premise for this is to have the same data available for multiple runs. If we want to train a Machine Learning Model, we want to keep as much historic data as possible to achieve better results, so having access to historic data is key.3. Store data using pandas
Storing JSON data in Pandas is as simple as calling DataFrame.to_json() - specifying the filename as first argument. We specify orient equals "records", to archive a human readable result. As we can see, the format of the data stored is identical to the downloaded data. There are many other storage formats available like CSV or hdf5, all with different benefits and drawbacks, but the formats are beyond the scope of the course.4. Reading stored data
After having collected the data and stored some history to disk, we will have to load the data. Pandas provides different convenient methods to load the data, depending on the storage format used. Common formats include csv and json. If the data is stored as JSON, we use pd.read_json() to load the data. Similarly, when the data is saved as csv file, we can use pd.read_csv() to load the data.5. Validate data load
After having loaded the data, we should have a quick look at the data and check if the data was loaded correctly. The simplest way to quickly check a few things is to use df_env.head(), which will print the first 5 rows by default.6. DataFrame.info()
We can get an overview of the loaded data by using DataFrame.info(). DataFrame.info provides a quick summary of the DataFrame. We can see the number of columns, 5 in this case, the column names, the number of non-null values for each column, as well as the datatype for each column. This allows us to quickly identify if loading the data did work as expected, or if some datatypes are unexpected and need fixing. DataFrame.info also provides a one-line summary-line for the datatypes in the DataFrame, counting the occurrences for each datatype. Additionally, we can see the memory usage of the DataFrame.7. pandas describe()
DataFrame.describe() is another method pandas provides to get a quick overview of the data. It automatically calculates multiple summary statistics for each numeric column. We can see that the timestamp column is missing since it's type is datetime, and therefore not numeric. The sunshine column has more than 50% 0 values. This is expected since the sun is not shining during the night, and daylight time is shorter than nighttime during winter in the northern hemisphere.8. Time for Practice!
And now it's time for you to practice.Create Your Free Account
or
By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.