I would say start with putting the data into a pandas dataframe.
Once you get it in a dataframe (generally you nickname it 'df') start with 1. 'df.head()'. This gives you the first 10 rows of data. 2. 'df.describe()'. Names and data types 3. 'df.info()'. Dimension of the columns and rows. This 3 commands is will start your journey towards Exploratory Data analysis( EDA) which is a very important step before we get to apply all the cool Machine learning algorithms. Beware, do not undermine EDA. Proper EDA will allow us to determine what can we find from the data which will help determining what machine learning model to be used and what preprocessing needs to be done with the data. Preprocessing and model running can be an iterative task and we can reduce the iterations and time consumption by concentrating on EDA. The above 3 commands will start us with Numerical EDA. Along with it some visual EDA is also very helpful. I know you are thinking about histogram for single variable and scatter plot('scatter_matrix()') to see preliminary relationship between 2 variables that you think are most important for prediction of the target. These are definitely the first steps. But when all the features are binary( 0 or1) a different approach might help. Seaborn's countplot is a great plot that works in such cases. Below is a sample code which uses a dataset that shows Repblicans and democrats and the bills they voted on. It plots the response (Yes as 1, No as 0) of members for the 'education' bill colored by party. plt.figure() sns.countplot(x='education', hue='party', data=df, palette='RdBu') plt.xticks([0,1], ['No', 'Yes']) plt.show() Hope it helps.. :)
0 Comments
Leave a Reply. |
AuthorMusic, arts, science, technology, economy, politics, non-profit...data is everywhere. So much to know, so much to learn, so much to tell. This is my quest,my story. Archives
February 2021
Categories |