Sci Math

Physics	School Physics
Relativity
Mathematics	School Mathematics Subjects
Computer Science & Programming	JavaScript
Python
C++
Data Science

Data Science Terminology - Statistical learning

Statistical learning, also known as machine learning, is a field of study that focuses on developing algorithms and techniques to enable computers to learn from and make predictions or decisions based on data. In statistical learning, we use mathematical and statistical methods to analyze and interpret patterns and relationships in data, with the goal of creating models that can generalize well to new, unseen data.

There exist two main types of statistical learning approaches:

'Supervised Learning' in which the algorithm is trained on labeled data, where each example is associated with a target or outcome variable. The algorithm learns from the input-output pairs in the training data and aims the generalize its predictions to new, unseen data. Common tasks in supervised learning include classification (predicting categories or classes) and regression (predicting continuous values).

'Unsupervised Learning' in which the algorithm is trained on unlabeled data, where there are no predefined target variables. The goal of unsupervised learning is to uncover hidden patterns, structures, or relationships in the data. Common tasks in unsupervised learning include clustering (grouping similar data points together) and dimensionality reduction (reducing the number of features or variables while preserving important information).

Statistical learning techniques are used in a wide range of applications, including image and speech recognition, natural language processing, recommandation systems, financial modeling, and many others. The field continues to advance rapidly, driven by innovations in algorithms, computational power, and the availability of large datasets.

Data Science Terminology - Continuous vs Quantitative Output Values

In statistical learning or machine learning, continuous and quantitative output values refer to the type of data that the model aims to predict or estimate.

'Continuous Output Values' are those that can take on any real number within a certain range. These values represent measurements that are not restricted to specific discrete points. For example, predicting the price of a house, the temperature, or the stock price are all examples of tasks where the output values are continuous.

'Quantitative Output Values' are similar to continuous values in that they are numeric and represent some quantity. However, they are typically more specific in nature and often represent counts or measurements of discrete objects or events. For example, predicting the number of items sold, the age of a person, or the number of defects in a manufacturing process are tasks where the output values are quantitative.

The main difference between continuous and quantitative output values lies in the nature of the data and the granularity of the predictions. Continuous output values can take on any value within a range, while quantitative output values are often counts or measurements of discrete entities or events.

However, the terms are often used interchangeably because quantitative data often involves continuous measurements.

Let's take a real life example:

Suppose we are predicting house prices based on various features like area, number of bedrooms and location. The predicted price would be a continuous output value because it can take any value within a certain range, such as $100,000 to $1,000,000. It is also a quantitative output value because it represents the amount of money, which is a quantity.

Continuous Output Value Examples:

Temperature readings can be measured with precision, such as 20.5°C or 68.3°F. The temperature can vary continuously, and there's no limit to the number of possible values.

Time is another continuous output value, where it can be measured down to fractions of a second. For example, 10:30:25.123 AM represents a precise time.

Quantitative Output Value Examples:

Number of Sales. In business, the number of sales made by a company within a certain period is a quantitative value. It represents the quantity of products of services sold.

The population of a city, country, or region is also quantitative. It represents the number of people living in that area and can be measured in millions or billions.

Inventory Levels. For a retail store, the quantity of items in inventory indicates how many units of each product are available for sale.

Test scores, such as scores on a math exam or a standardized test, represent a quantitative value of performance or proficiency level of individuals in a particular subject.

Data Science Terminology - Regression Problem

Regression is a type of predictive modeling technique used in statistical learning, particularly in supervised machine learning. It's used when the target varialble, or the variable we want to predict, is continuous. In other words, regression helps us understand the relationship between one or more independent variables (also called predictors or features) and a continuous outcome variable.

The goal of regression analysis is to find the best-fitting mathematical model that describes the relationship between the independent variables and the dependent variable. This model can then be used to make predictions about the dependent variable for new data points where the independent variables are known.

For example, let's say we want to predict house prices based on factors like square footage, number of bedrooms, and location. In this case, square footage, number of bedrooms, and location would be our independent variables, and house price would be our dependent variable. We could use regression analysis to build a model that quantifies how each of these independent variables influences the house price, allowing us to make predicions about the price of a house based on its features.

There are different types of regression techniques, including linear regression, polynomial regression, and logistic regression, each suited for different types of data and relationships between variables. Linear regression, for example, assumes a linear relationship between the independent and dependent variables, while logistic regression is used when the dependent variable is binary (e.g., yes/no, true/false).

Overall, regression analysis is a powerful tool for understanding and predicting continuous outcomes based on other variables, making it widely used in fields like economics, finance, healthcare, and more.

Data Science Terminology - Clustering Problem

Clustering is a type of unsepervised machine learning technique used to group similar data points together based on their features or characteristics. Unlike supervised learning, where the algorithm is trained on labeled data (data with known outcomes), unsupervised learning works with unlabled data, meaning there are no predefined categories or classes.

The goal of clustering is to discover inherent structures or patterns in the data without prior knowledge of what those patterns might be. The algorithm idenrigies groups of data points that are more similar to each other than to those in other groups. These groups are called clusters.

Clustering algorithms typically measure the similarity between data points using a distance metric, such as Euclidean distance or cosine similarity. Data points that are close to each other in feature space are considered similar and are likely to be grouped together in the same cluster.

There are various clustering algorithms, each with its own approach to identifying clusters:

One of the most popular clustering algorithms is 'K-means' which partitions the data into a predefined number of clusters (K) based on the mean value of the data points in each cluster. It iteratively assigns data points to the nearest cluster centroid and updates the centroids until convergence.

'Hierarchical clustering': This algorithm builds a hierarchy of clusters by recursively merging or splitting clusters based on their similarity. It can produce dedrogram representations that show the hierarchical relationships between clusters.

'DBSCAN' (Density-Based Spation Clustering of Applications with Noise):

'Agglomerative clustering' is similar to hierarchical clustering, agglomerative clustering starts with each data point as a separate cluster and iteratively merges the closest pairs of clusters until only one cluster remains or a stopping criterion is met.

Clustering has many applications across various domains, including customer segmentation, anomaly detection, image segmentation, and document clustering. It helps uncover hidden patterns in data and can provide valuable insights for decision-making and analysis.

Hierarchical Algorithms - Agglomerative Clustering

Agglomerative clustering is a bottom-up hierarchical clustering algorithm. It starts with each data point as a separate cluster and then iteratively merges the closest pairs of clusters until only one cluster remains or a stopping criterion is met.

A a step-by-step explanation of how agglomerative clustering works:

1. Initialization.

      Start with each data point as a singleton cluster. Each data point is considered a cluster of its own.

2. Compute pairwise distances.

      Calculate the distance or similarity between all pairs of clusters. The distance between clusters can be computed using various metrics, such as Euclidean distance, Manhattan distance, or cosine similarity.

3. Merge closest clusters.

      Identify the two clusters that are closest to each other based on the distance metric. Merge these two clusters into a single cluster.

4. Update distance matrix.

      Update the distance to matrix to reflect the distances between the newly formed cluster and all other clusters. Depending on the linkage criteria chosen (e.g., single linkage, complete linkage, avarage linkage), the distance between clusters may be computed differently.

5. Repeat.

      Repeat stept 2-4 until only one cluster remains or a stopping criterion is met. This stopping criterion can be based on the number of desired clusters, a specified distance threshold, or other criteria.

The process of merging clusters continues iteratively until the desired number of clusters is obtained or until clusters become too dissimilar to merge further.

One key aspect of agglomerative clustering is the choice of linkage criteria, which determines how the distance between clusters is computed. There are several common linkage criteria:

'Single linkage' - the distance between two clusters is defined as the minimum distance between any two points in the two clusters. It tends to produce elongated clusters.

'Complete linkage' - the distance between two clusters is defined as the maximum distance between any two points in the two clusters. It tends to produce compact, spherical clusters.

'Average linkage' - the distance between two clusters is defined as the average distance between all pairs of points in the two clusters. It provides a balance between single and complete linkage.

Agglomerative clustering is intuitive and easy to understand, making it a popular choice for hierarchical clustering tasks. It can produce dendrogram visualizations that illustrate the hierarchical relationships between clusters, which can be helpful for understanding the structure of the data.

Agglomerative clustering can be used in various real-world applications across different domains. Some of the use cases include:

Market Segmentation

In marketing, agglomerative clustering can be used to segment customers based on their purchasing behavior or demographic information. By identifying groups of customers with similar characteristics, businesses can tailor their makreting strategies to target each segment more effecrively.

Genomic Analysis

In bioinformatics, agglomerative clustering can be applied to analyze gene expression data or DNA sequences. By clustering genes or genomic regions with similar expression patterns or sequences, researchers can uncover insights into genetic regulation, disease mechanisms, or evolutionary relationships.

Image Segmentation

In computer vision, agglomerative clustering can be used for image segmentation tasks. By clustering pixels based on their color or intensity values, images can be partitioned into distinct regions or objects. This is useful for tasks such as object detection, image segmentation, or content-based image retrieval.

Anomaly Detection

Agglomerative clustering can also be used for anomaly detection in various domains, such as network security, fraud detection, or equipment maintenance. By clustering data points and identifying clusters with significantly different characteristics from the rest, anomalies or outliers can be detected.

Document Clustering

In natural language processing (NLP), agglomerative clustering can be used for document clustering or topic modeling. By clustering documents based on their similarity in terms of word usage or semantic content, documents can be organized into thematic groups or topics, enabling tasks such as document classification, summarization, or recommendation.

These are just a few examples of how agglomerative clustering can be applied in practice. Its flexibility and versatility make it a valuable tool for exploratory data analysis, pattern recognition, and knowledge discovery in various fields.

Pros and Cons of Tools & Libraries dedicated for Data Science

Python is a #highLevel programming language widely used across a variety of industries. Looking at the ranking charts of languages most commonly used among data scientists, it holds a very strong first place, well above the others by over 10%. We are noting a decline here, though, by 2%. Whilst the second in line, Java, is having a better time with 0.8% incline of usage in the data science industry. Also 10% further away places itself JavaScript programming language perhaps due to the campatibility with online websites and platforms. Other programming languages that follow our adorable and very visual JavaScript are not so far behind. Only by a little over 1% is C#, by 1.5% C/C++, and the last one being PHP with only a bit under 3% lacking to stand on the podium.

Data processing and modeling.

There are at least 8 Py libraries fit to accomplish such tasks.

1. NumPy (for Numerical Python) is a perfect tool for scientific computing and performing basic and advanced operations with arrays.

The library offers many handy features for performing operations on n-arrays and matrices in Python. It makes it possible to deal with arrays that store values of the same data type and facilitates the execution of mathematical operations on the arrays (and their vectorization). In fact, vectorizing mathematical operations on the NumPy array type increases performance and speeds up execution time.

2. SciPy (for Scientifical Python). This useful library includes modules for linear algebra, integration, optimization, and statistics. Its core functionality was built on NumPy, so its arrays use that library. SciPy works great for all sorts of scientific programming projects (science, math, and engineering). It offers efficient numerical routines such as numerical optimization, integration and others in sub-modules. The extensive documentation makes working with this library really easy.

3. Pandas is a library created to help developers work intuitively with 'tagged' and 'relational' data. It is based on two main data structures: 'Serial' (one-dimensional, like a Python list) and 'Dataframe' (two-dimensional, like a multi-column array). Pandas allows converting data structures into DataFrame objects, handling missing data and adding/removing columns from DataFrame, imputing missing files and plotting data with histogram or boxplot. It is an indispensable tool for data manipulation and visualization.

4. Keras is an excellent library for building neural networks and modeling. It is very simple to use and offers developers a good degree of extensibility. The library takes advantage of other packages (Theano or TensorFlow) as terminals. Moreover, Microsoft has integrated CNTK (Microsoft Cognitive Toolkit) to serve as another backend. This is a great choice if you want to experiment quickly using compact systems - the minimalist approach to design is really great!

5. Scikit-Learn. It is an industry standard for Python-based data science projects. Scikits is a group of SciPy packages that were created for specific functionality - for example, image processing. Scikit-learn uses SciPy mathematical operations to expose a concise interface to the most common machine learning algorithms.

Data scientists use it to handle standard machine learning and data mining tasks such as clustering, regression, and classification. Another advantage? It comes with great documentation and offers high performance.

6. PyTorch is a framework that is perfect for data scientists who want to easily perform Deep Learning tasks. The tool allows performing tensor calculations with GPU acceleration. It is also used for other tasks - for example, to create dynamic computational graphs and automatically calculate gradients. PyTorch is based on Torch, which is an open-source Deep Learning library, implemented in C, with a skin in Lua.

7. TensorFlow is a Python framework popular to use in Machine Learning and Deep Learning, which was developed at Google Brain. It is the best tool for tasks such as object identification, voice recognition and many more. It lets you work with artificial neural networks that need to handle multiple data sets. The library includes several layer helpers (tflearn, tfslim, skflow) which make it even more functional. TensorFlow is constantly updated with new versions, including fixing possible security vulnerabilities or improving the integration of TensorFlow and the GPU.

8. XGBoost - Use this library to implement Machine Learning algorithms in the Gradient Boosting framework. XGBoost is portable, flexible and efficient. It offers parallel tree boosting that helps teams solve many data science problems. Another benegit is that developers can run the same code on leading distributed environments such as Hadoop, SGE, and MPI.

9. Theano

Data Visualisation

1. Matplotlib is a standard data science library that helps produce data visualizations such as two-dimensional charts and graphs (histograms, scatter plots, non-Cartesian coordinate plots). Matplotlib is one of those plot libraries that are really useful in data science projects - it provides an object-oriented API for integrating plots into applications.

It is thanks to this library that Python can compete with scientific tools like MatLab or Mathematica. However, developers have to write more code than usual when using this library to build advanced visualizations. It is worth noting that popular plotting libraries work seamlessly with Matplotlib.

2. Seaborn is based on the previously mentioned

NumPy - Efficient and Fast Open Source Python Library for Performing Numerical Operations

#import NumPy library

import numpy as np

#create Numpy array

regular_array = np.array([1, 2, 3, 4])

#print the created array

print(regular_array)

#array of zeros (shape of it within parentheses)

zeros_array = np.zeros((3, 3))

print(zeros_array)

#Output:

[[0. 0. 0.]

 [0. 0. 0.]

 [0. 0. 0.]]

#array of ones

ones_array = np.ones((3, 3))

print(ones_array)

#Output:

[[1. 1. 1.]

 [1. 1. 1.]

 [1. 1. 1.]]

#empty array

empty_array = np.empty((2, 3))

print(empty_array)

[[0. 0. 0.]

 [0. 0. 0.]

 [0. 0. 0.]]

#arange method on array

array_arange = np.arange(12)

print(array_arange)

[0 1 2 3 4 5 6 7 8 9 10 11]

array_arrange.reshape(3, 4)

#Output:

[[0, 1, 2, 3],

 [4, 5, 6, 7],

 [8, 9, 10, 11]]

#linspace - equaly spaced data elements

linear_data = np.linspace(11, 20, 5)

#11 - first element

#20 - last element

#5 - number of equidistant elements

print(linear_data)

#Output:

[11. 13.25 15.5 17.75 20. ]

#One dimensional array

one_dimension = np.arrange(15)

print(one_dimension)

#Output:

[ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14]

#Two dimensional array

two_dimensions = one_dimension.reshape(3, 5)

print(two_dimensions)

#Output:

[[0, 1, 2, 3, 4],

 [5, 6, 7, 8, 9]

 [10, 11, 12, 13, 14]

#Three dimensional array

three_dimensions = np.arrange(27).reshape(3, 3, 3)

print(three_dimensions)

#Output:

[[[ 0 1 2 ]

  [ 3 4 5 ]

  [ 6 7 8 ]]

 [[ 9 10 11 ]

  [ 12 13 14 ]

  [ 15 16 17 ]]

 [[ 18 19 20 ]

  [ 21 22 23 ]

  [ 24 25 26 ]]]

PyGym Exercise 2.1

Write a program that asks the user for their name and year of birth, then calculates and prints their age.

Examplary execution:

Give your name:

Johnny

Enter year of birth:

1989

John, you are 30 years old.

PyGym Exercise 2.2