I would say start with putting the data into a pandas dataframe.
Once you get it in a dataframe (generally you nickname it 'df') start with
1. 'df.head()'. This gives you the first 10 rows of data.
2. 'df.describe()'. Names and data types
3. 'df.info()'. Dimension of the columns and rows.
This 3 commands is will start your journey towards Exploratory Data analysis( EDA) which is a very important step before we get to apply all the cool Machine learning algorithms. Beware, do not undermine EDA. Proper EDA will allow us to determine what can we find from the data which will help determining what machine learning model to be used and what preprocessing needs to be done with the data.
Preprocessing and model running can be an iterative task and we can reduce the iterations and time consumption by concentrating on EDA.
The above 3 commands will start us with Numerical EDA. Along with it some visual EDA is also very helpful.
I know you are thinking about histogram for single variable and scatter plot('scatter_matrix()') to see preliminary relationship between 2 variables that you think are most important for prediction of the target. These are definitely the first steps. But when all the features are binary( 0 or1) a different approach might help. Seaborn's countplot is a great plot that works in such cases.
Below is a sample code which uses a dataset that shows Repblicans and democrats and the bills they voted on. It plots the response (Yes as 1, No as 0) of members for the 'education' bill colored by party.
sns.countplot(x='education', hue='party', data=df, palette='RdBu')
plt.xticks([0,1], ['No', 'Yes'])
Hope it helps.. :)
After a year of being a 'stay at home Mom' of 2 year old I decided to change the name of my blog and be active at writing again. This year things did not go as I planned. So here I am, putting myself out there, only my thoughts...no data or statistics...yet.
This year of 2020 is a very special one. It changed the whole humanity making most of us more humble and aware of the insignificant role we hold in front of mother nature. A small virus from a bat in a corner of the world changed the a way a whole species behaves all over the world.
Being in United States made things more interesting as it was a time of change here politically, culturally and thus economically. Things are hard for a lot of people but there is hope
The first step to look for a new opportunity is to figure where to look. The following tableau story tries to answer this question for high skilled foreign workers in USA. The data is collected from the publicly available work visa data from Office of Foreign labor Certification website.
Some insights derived from the data:
Who is paid highest?
- Techies are not the highest paid foreign workers, attorneys are.
- It is a really good time to be a data scientist as they are very less in number but very highly paid.
- Visa class do not effect the salary for most of the job groups.
- The maximum median paid wage for data scientists are on a rise! The minimum paid wage, though shows a lowering trend, it is not significant enough to consider.
- Business Analysts and Data Analysts are at a higher risk of getting a lower median paid wage than before.
- Software engineers have the chance of higher pay and also the risk of lower pay.
Who pays the most?
Contrary to popular belief, California is not the highest paying state for Analysts. Montana pays highest to business analysts and Delaware pays highest to data analysts.
Considering the cost of living i.e. adjusting the median pay with 2015 price parity data for each state , it looks like California is not the best place to earn high salary as data scientist. The best place for data scientist is Arizona.
The following screenshots from the story shows how you can filter out the best place and the best companies according to the job group, experience in months and education level.
The professions considered here are attorney, professors, data scientist, data analyst, business analyst, software engineers, management consultant and teacher. This visualization is inspired from the 'Data Visualization with Tableau' course from Duke university by Coursera.
Take a look at the whole story here.
You have a report to send (or a presentation). The deadline is 30 minutes.
While going through it for the final time you realize, the state-wise data that you have presented in bar chart can be shown on a map more intuitively. But you are out of time. Also you do not have any access to the visualization software that you generally use in office computer.
The best solution is to open mapchart.net .
Here is my small project with mapchart and it took just 15 mins to create two maps with the available data.
Here I compared the different states with respect to death by cancer in 100,000 population in 2016. I took the data from www.americashealthrankings.org
Few interesting things that can be seen from the comparison are -
- Females deaths are less than male deaths.
- Utah and Colorado have overall least caner deaths while Kentucky, West Virginia and Mississippi has the highest.
By comparing different data like smoking percentage, personal income, healthcare cost, insurance umbrella, poverty etc between different states there may be a way to find a relation between the different players influencing cancer in the states. This is an ongoing project.
Here's the steps to use mapchart.
Step 1: Get the data
Step 2: Open mapchart.net (Do not use edge. The map gets created nicely but behaves a bit funny when downloaded as picture)
Step 3: Select USA-> States (You can select a lot of other countries and the world)
Step 4: Select the fill color ( You can change the color by a click)
Step 5: Click on the states with the color you want to fill.
Step 6: Change the 'Fill Color' and click on other states.
Step 7: Fill up the 'Legend Title' with the title of your map.
Step 8: Fill up the 'Label' beside each 'Color'...or 'Remove' them.
Step 9: Download the image
You can download my data here: