To achieve a better and professional level in Machine Learning as well as a quick start, We have to have some basic and professional level skills and knowledge in some related topics. In this post we will see the related topics, where to start, where the resources are available for study and practice.
 Understand your data first:
“Data! Data! Data!” he cried impatiently. “I can’t make bricks without clay.” —Arthur Conan Doyle.
Machine Learning is highly dependent on data, historical data. we are to collect, clean, preprocess (For ML model need numeric data) the data then identify and discover the hidden pattern of data to teach the machine. There are basically two types data – Structured (organized) data, Unstructured (unorganized) data. These two data types can be defined as follows:
 Quantitative data: This data can be described using numbers, and basic mathematical procedures, including addition, are possible on the set.
Quantitative/Numeric data can be broken down, one step further, into discrete and continuous quantities. These can be defined as follows: Discrete data: This describes data that is counted. It can only take on certain values. Examples of discrete quantitative data include a dice roll, because it can only take on six values, and the number of customers in a café, because you can’t have a real range of people.
 Continuous data: This describes data that is measured. It exists on an infinite range of values.
 Qualitative data: This data cannot be described using numbers and basic mathematics. This data is generally thought of as being described using “natural” categories and language.
Qualitative /Categorical data takes only a fixed set of values, such as a type of TV screen (plasma, LCD, LED, etc.) or a state name (Alabama, Alaska, etc.). Binary data is an important special case of categorical data that takes on only one of two values, such as 0/1, yes/no, or true/false. Another useful type of categorical data is ordinal data in which the categories are ordered; an example of this is a numerical rating (1, 2, 3, 4, or 5)
 Quantitative data: This data can be described using numbers, and basic mathematical procedures, including addition, are possible on the set.
 There also four levels of data to understand:
 The nominal level: Data at the nominal level is mostly categorical in nature. At the nominal level, we deal with data usually described using vocabulary (but sometimes with numbers), with no order, and little use of mathematics. In order to find the center of nominal data, we generally turn to the mode (the most common element) of the dataset.
 The ordinal level: At the ordinal level, we have data that can be described with numbers and also have a “natural” order, allowing us to put one in front of the other. At the ordinal level, the median is usually an appropriate way of defining the center of the data. The mean, however, would be impossible because division is not allowed at this level.
 The interval level: Allow Addition, Subtraction. At this level, we can use the median and mode to describe this data; however, usually the most accurate description of the center of data would be the arithmetic mean, more commonly referred to as, simply, “the mean”
 The ratio level: the ratio level proves to be the strongest of the four. Not only can we define order and difference, the ratio level allows us to multiply and divide as well. The arithmetic mean still holds meaning at this level, as does a new type of mean called the geometric mean. Data at the ratio level is usually nonnegative.
 These two data types can be defined as follows:
Understanding of data will help to collect, clean and preprocess data as per the need to build the ML model
 Programming Langulage & Lybraries:
For machine learning Python/R is in leading but from a developer perspective Python could be a choice for it’s huge community and rich libraries. We need a bettter understanding with all of it’s language related basics like Lists, List Slicing, Tuples, Dictionaries, Counters, Sets, Zip and Argument Unpacking , data structure and algorighms. As well as following libraries are need to know while starting NumPy is a well known generalpurpose arrayprocessing package. An extensive collection of high complexity mathematical functions make NumPy powerful to process large multidimensional arrays and matrices
 SciPy is a free and opensource library that’s based on NumPy. It can be used to perform scientific and technical computing on large sets of data. Similar to NumPy, SciPy comes with embedded modules for array optimization and linear algebra. It’s considered a foundational Python library due to its critical role in scientific analysis and engineering.
 scikitlearn is a free Python library based on NumPy and SciPy, that’s often considered a direct extension of SciPy. It was specifically designed for data modeling and developing machine learning algorithms, both supervised and unsupervised.
 Pandas is another Python library that is built on top of NumPy, responsible for preparing highlevel data sets for machine learning and training. It relies on two types of data structures, onedimensional (series) and twodimensional (DataFrame).
 Seaborn is another opensource Python library, one that is based on Matplotlib (which focuses on plotting and data visualization) but features Pandas’ data structures. Seaborn is often used in ML projects because it can generate plots of learning data.
 Matplotlib is a Python library focused on data visualization and primarily used for creating beautiful graphs, plots, histograms, and bar charts. It is compatible for plotting data from SciPy, NumPy, and Pandas. If you have experience using other types of graphing tools, Matplotlib might be the most intuitive choice for you.there are many more libraries
 Linear Algebra:
 Vectors Basics: Introduction to vectors, Vector arithmetic, Coordinate system
 Vector Projections and Basis: Dot product of vectors, Scalar and vector projection, Changing basis of vectors, Basis, linear independence, and span
 Matrices: Matrices introduction,Types of matrices,Types of matrix transformation,Composition or combination of matrix transformations
 Gaussian Elimination: Solving linear equations using Gaussian elimination,Gaussian elimination and finding the inverse matrix, Inverse and determinant
 Matrices from Orthogonality to Gram–Schmidt Process: Matrices changing basis ,Transforming to the new basis ,Orthogonal matrix, Gram–Schmidt process
 Eigenvalues and Eigenvectors: Calculating eigenvalues and eigenvectors
For Linear algebra Khan Accademy and edX Essential Math for Machine Learning: Python Edition
 Statistics & Probability:
 Populations and Samples
 Mean, Median, Mode
 Variance, Range, Inter Quartile Range (IQR), Skewness
 Correlation, Correlation and Causation
 Dependence and Independency
 Conditional Probability
 Bayes’s Theorem and Random variables
 Distributions
 Continuous Distribution
 Normal Distribution
 Central Limit Theorem (CLT)
 Hypothesis Testing: Null Hypothesis(H0), Alternate Hypothesis (HA)
 Errors in Hypothesis Testing: Type 1 Error, Type 2 Error
 Statistical Significance and PValues
 The chisquare goodness of fit test
 Confusion Matrix, Precision, Accuracy, Recall, F1 Score
Resource for learning: Statistics and probability – Khan Academy, Introduction to StatisticsCoursera, Statistics for Data Science with PythonCoursera
 Database Knowledge: Some basics of SQL and NoSQL may help. Basic of SQL queries like
 Join (Inner, Outer, Full, Cross)
 Pivot and Unpivote
 Window Functions
 Aggregate Window Functions – SUM, MIN, MAX, AVG, COUNT
 Ranking Window Functions – RANK, DENSE_RANK, ROW_NUMBER, NTILE
 Value Window Functions – LAG() and LEAD,FIRST_VALUE and LAST_VALUE
 IDE:
 Spyder: Scientific Python Development Environment (Spyder) is a free & opensource python IDE. It is lightweight and is an excellent python ide for data science & ML
 Jupiter Notebook: For its simplicity this one became a great IDE among the data enthusiasts as it is the descendant of IPython. Best thing of JuPyter is that there you can very easily switch between the different versions of python (or any other language) according to your preference.
 Visual Studio Code: Visual Code is one of the most used Python IDE by ML & DS professionals. It works on Windows, Mac, and Linux operating systems.
 Machine Learning: Supervised and Unsupervised learning are the two techniques of machine learning. But both the techniques are used in different scenarios and with different datasets.
Supervised learning is a machine learning method in which models are trained using labeled data.In supervised learning, models need to find the mapping function to map the input variable (X)
with the output variable (Y).
Unsupervised learning is another machine learning method in which patterns inferred from the unlabeled input data. The goal of unsupervised learning is to find the structure and patterns from the input data. Unsupervised learning does not need any supervision. Instead, it finds patterns from the data by its own. Machine Learning steps are – Business Understanding: What you want to achieve by implementing it. This is the “business understanding”.
 Identify the business objective/problem
 Regression problems: We have data that needs to be mapped onto a predictor variable, so we need to learn a function that can do this mapping.
 Classification problems: Here, we have data that needs to be divided into predefined classes, based on some features of the data. We need an algorithm that can use previously classified data to learn how to put unknown data into the correct class. Ex: KNearest Neighbor for classification of a categorical outcome or prediction of a numerical outcome, Naïve Bays Classifier Can be applied to data with categorical predictor.
 Summarization problems: Suppose we have data that needs to be shortened or summarized in some way. This could be as simple as calculating basic statistics from data, or as complex as learning how to summarize text or finding a topic model for text.
 Dependency modeling problems: For these problems, we have data that might be connected in some way, and we need to develop an algorithm that can calculate the probability of connection or describe the structure of connected data.
 Change and deviation detection problems: In another case, we have data that has changed significantly or where some subset of the data deviates from normative values. To solve these problems, we need an algorithm that can detect these issues automatically.
 Assess the situation
 Determine the analytical goals
 Produce a project plan
 Identify the business objective/problem
 Data Understanding: Determining what kind of data can be collected to build a deployable model. In the next phase, “data understanding,”
 Collect the data
 Describe the data
 Explore the data
 Verify the data quality
 Data Preprocessing:
 Importing/Selecting the Dataset
 Handling Missing Data
 Handling Categorical Data
 Splitting the Dataset into the Training set and Test set
 Feature Scaling
 Importing/Selecting the Dataset
 Business Understanding: What you want to achieve by implementing it. This is the “business understanding”.

 Modeling: As per the business need we have to chose appropriate model from various kind of available models
 Regression: Value estimation, Ex how much wills this customer use the service?
 Linear Regression: Linear regression is an approach in modeling that helps model the scalar linear relationship between a scalar dependent variable, Y, and an independent variable, X, which can be one or more in value: y = X_ +_
 Simple linear regression: A simple linear regression has a single variable, and it can be described using the following formula: y= A + Bx

Multiple linear regression model: Multiple linear regression occurs when more than one independent variable is used to predict a dependent variable: Y_ = a +b1x1+b2x2 +…+ bnxn

Polynomial Regression: A regression equation is a polynomial regression equation if the power of independent variable is more than 1. The equation below represents a polynomial equation: y=a + b*x^2
 NonLinear Regression:
 Support Vector Regression SVR
 Decision Tree Regression
 Random Forest Regression
 Linear Regression: Linear regression is an approach in modeling that helps model the scalar linear relationship between a scalar dependent variable, Y, and an independent variable, X, which can be one or more in value: y = X_ +_
 Classification: Will this customer purchase service S1 if given incentive I? Which service package (S1, S2, or none) will a customer likely purchase if given incentive I?
 Decision trees: Decision tree is a type of supervised learning algorithm (having a predefined target variable) that is mostly used in classification problems. It works for both categorical and continuous input and output variables.
 Random Forest Classification: Random Forest is a trademark term for an ensemble of decision trees. In Random Forest, we’ve collection of decision trees (so known as “Forest”). To classify a new object based on attributes, each tree gives a classification and we say the tree “votes” for that class. The forest chooses the classification having the most votes (over all the trees in the forest.
 Logistic regression: Logistic regression extends the idea of linear regression to the situation where the dependent variable Y is categorical. We can think of a categorical variable as dividing the observations in to classes. Don’t get confused by its name! It is a classification not a regression algorithm.

The naive Bayes classifier: It works on Bayes theorem of probability to predict the class of unknown data set. Naive Bayes uses a similar method to predict the probability of different class based on various attributes. This algorithm is mostly used in text classification and with problems having multiple classes.

SVM (Support Vector Machine): It is a classification method. In this algorithm, we plot each data item as a point in ndimensional space (where n is number of features you have) with the value of each feature being the value of a particular coordinate. For example, if we only had two features like Height and Hair length of an individual, we’d first plot these two variables in two dimensional space where each point has two coordinates (these coordinates are known as Support Vectors)
 KNearest Neighbor: KNN can be used for both classification and regression predictive problems. The “K” in KNN algorithm is the nearest neighbors we wish to take vote from?
 Clustering:
 kmeans clustering: The kmeans clustering is an unsupervised learning technique that helps in partitioning data of n observations into K buckets of similar observations.
 Hierarchical clustering: Hierarchical clustering is an unsupervised learning technique where a hierarchy of clusters is built out of observations
 Divisive hierarchical clustering: This is a topdown approach where observations start off in a single cluster and then they are split into two as they go down a hierarchy.
 Recommender Systems: Recommendations help monetize user behavior data that businesses capture. This allows them to recommend the content that they like. Recommender systems are a way of suggesting or similar items and ideas to a user’s specific way of thinking.
 Regression: Value estimation, Ex how much wills this customer use the service?
 Model Evaluation:
 Evaluate the results, is it Overfitting (Good performance on the training data, poor generliazation to other data) or Underfitting (Poor performance on the training data and poor generalization to other data)
 Review the process
 Determine the next steps based on evaluation score
 Test & Deployment: After validation using various matix model is ready for testing. Here test set will be used for testing. The more data we will use the more the model will be robust.More resource
 Modeling: As per the business need we have to chose appropriate model from various kind of available models
for study and parctice : Machine Learning with Python Coursera , Kaggle, edX
Hope this will help to quick start and Achieve Success Quicker.
Thanks…..