The importance of flat files in data science
1. The importance of flat files in data science
Now you know how to import plain text files,2. Flat files
we're going to look at flat files, such as 'titanic dot csv',3. Flat files
in which each4. Flat files
row is a unique passenger onboard and each5. Flat files
column is a feature of attribute, such as gender, cabin and 'survived or not'. It is essential for any budding data scientist to know precisely what the term flat file means.6. Flat files
Flat files are basic text files containing records, that is, table data, without structured relationships. This is in contrast to a relational database, for example, in which columns of distinct tables can be related. We'll get to these later. To be even more precise, flat files consist of records, where by a record we mean a row of fields or attributes, each of which contains at most one item of information. In the flat file 'titanic dot csv', each7. Flat files
row or record is a unique passenger onboard and each column is a feature or attribute, such as8. Flat files
name, gender and cabin.9. Header
It is also essential to note that a flat file can have a header, such as in 'titanic dot csv', which is a10. Header
row that occurs as the first row and describes the contents of the data columns or states what the corresponding attributes or features in each column are. It will be important to know whether or not your file has a header as it may alter your data import. The reason that flat files are so important in data science is that we data scientists really honestly like to think in records or rows of attributes.11. File extension
Now you may have noticed that the file extension was dot csv. You may be wondering what this is? Well, CSV is an acronym for comma separated value and it means exactly what it says. The values in each row are separated by commas. Another common extension for a flat file is dot txt, which means a text file. Values in flat files can be separated by characters or sequences of characters other than commas, such as a tab, and the character or characters in question is called a delimiter.12. Tab-delimited file
See here an example of a tab-delimited file. The data consists of the famous MNIST digit recognition images, where13. Tab-delimited file
each row contains the pixel values of a given image. Note that all fields in the MNIST data are numeric, while the 'titanic dot csv' also contained strings.14. How do you import flat files?
How do we import such files? If they consist entirely of numbers and we want to store them as a numpy array, we could use numpy. If, instead, we want to store the data in a dataframe, we could use pandas. Most of the time, you will use one of these options. In the rest of this Chapter, you'll learn how to import flat files that contain only numerical data, such as the MNIST data, and import flat files that contain both numerical data and strings, such as 'titanic dot csv'.15. Let's practice!
But first, lets get you to do a couple of quick multiple choice questions to test your knowledge of flat files.Create Your Free Account
or
By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.