Hint: it wasn’t for the pretty plots
Note: This article was originally published in Towards Data Science on December 26, 2020.
At the time of this writing, I am a fourth-year Ph.D. student at the University of Michigan, where I use machine learning and statistical modeling on biological datasets to study cancer metabolism and infer new potential cancer drug treatments.
Coming into graduate school, I had almost zero experience working in data science. By working exclusively on research projects, I managed to scrape by and developed models and software for studying cancer. I hope to share some of the things I have learned and mistakes I’ve made along the way as I finish my training.
The question of “Where should I start?” on getting started in data science is something I get asked a lot from people who want to transition careers. Let’s walk through and evaluate some popular starting points.
Some popular languages to learn for data science (and why I’m not recommending them)
There’s an ongoing debate on the best language to learn for getting started in data science.
Python is the one I hear the most because of its general-use, human-readable syntax, and its established data science community.
However, learning Python, while not that difficult, still has a non-negligible learning curve to overcome. Additionally, the computer science and statistics topics you need to tackle on add more friction to your learning journey. For these reasons, I wouldn’t start with Python as a fledgling data scientist.
R is another appealing language to learn, especially for its state-of-the-art mathematical and statistical modeling packages. But I would not recommend a beginner to start with R either for the same reasons as Python.
Julia is a relatively new language that combines the best of Python’s ease-of-use and computation of C++. Additionally, becoming an expert in Julia is a great opportunity to become a Julia expert and build new cool things. However, that also means there are fewer resources, support, and pre-built packages for software related to data science.
These arguments beg the question: what language should I start with to learn data science?
Where did I start?
In my graduate research, I got started learning Python and MATLAB. This was mainly because I needed to use software built on both languages. Looking back, I found that the language that accelerated my data science learning the most was MATLAB.
While I wouldn’t consider MATLAB to be a general-purpose programming language, I argue it should be a starting point to learn data science because:
- Its syntax is meant to reduce the friction for programming math problems
- Has interactive plotting features to facilitate data analysis
- Has built-in machine learning and deep learning toolboxes to learn high-level data science concepts.
MATLAB is built for linear algebra
As a data scientist, you will be using linear algebra to analyze large datasets with many attributes and samples. Other programming languages can perform linear algebra operations as well, if not better than MATLAB. However, MATLAB’s syntax is much more identical to standard linear algebra notation, which can drive home the core concepts better.
Let’s take the example of a simple operation: the dot product between two matrices A and B. We will compare the syntax in Python, R, and MATLAB.
First, here’s the Python implementation:
Seems simple, but we still have to use NumPy data objects and methods for the dot product.
Let’s now consider the R implementation:
Again, we need to load up the geometry library and deal with the less-intuitive R syntax, which can be confusing to a beginner.
Let’s now consider MATLAB’s implementation of the dot product:
MATLAB automatically knows how to construct a matrix, and visually, this looks more like what you would see in a linear algebra textbook.
Additionally, many linear algebra operations are built into MATLAB’s basic functionality because it is a language that was built for mathematical modeling and engineering.
MATLAB’s ease-of-use and similarity to standard linear algebra notation can be helpful to learn the fundamental mathematical concepts you will encounter in your data science career.
Interactive plots are built-in
While it’s possible to run MATLAB using the command line, it comes with an Integrated Development Environment (IDE) for writing software, testing code, and debugging.
In recent versions of MATLAB, interactive plots can be generated in a script or in a notebook, which can facilitate on-the-fly data analysis. We’ll take the simple example of generating a scatter plot from two vectors.
And below is the resulting scatterplot.
In MATLAB, I clicked on the two center data points to show their x- and y-coordinates. If we wanted to add more complexity, such as additional attributes associated with each data point (for example, Person A and Person B), this text would show up in those data points as well.
Interactive plots are possible in Python and R using libraries such as Plotly and Dash, and the plots you can make are beautiful. However, there is some learning you would need to do to format the data in a compatible way and design these plots.
MATLAB’s plots are not the prettiest, but the interactivity and simplicity in coding up these plots can facilitate complex data analysis, especially for a beginner.
Statistics, machine learning, and deep learning toolboxes are available to train models efficiently and quickly
MATLAB has additional software that enables model training, such as the Statistics and Machine Learning Toolbox. This extension allows the user to perform standard statistical procedures and train simple machine learning models.
To demonstrate how simple it is to prototype a machine learning model in MATLAB, let’s train a random forest model on the Cars dataset to predict fuel economy.
With less than 4 lines of code written, we can easily train a machine learning model. Additional functions are built in to help visualize and evaluate the model results.
Deep Learning applications are being built to solve real-world problems, and now is a great time to learn about these complex algorithms. MATLAB has developed the Deep Learning Toolbox, allowing us to train deep neural networks on image, text, and numerical data.
Why MATLAB won’t be dominating data science
While MATLAB can be useful for learning data science, it does have the following limitations that prevent it from becoming the primary language for data science:
- It’s really hard to create web applications in MATLAB.
- It is not open-source (and usually requires an institution to pay for the license).
- It lags in computing performance and does not stay current with state-of-the-art deep learning algorithms, compared to other languages.
These days, I still use MATLAB, but only for quick prototyping and running applications specific to my graduate research. Once I mastered the basics, I found myself learning Python and R for more general-purpose programming.
These days, I still use MATLAB, but only for quick prototyping and running applications specific to my graduate research. Once I mastered the basics, I found myself working on developing Python applications and performing more complex analyses in R.
However, learning data science in MATLAB served as an excellent foundation for my research, and hopefully can be the tool that gets you started in your data science journey.
If you want to share your perspectives about how you got started in data science or helpful tools to start learning, leave a response in the comments below. Additionally, feel free to reach out via LinkedIn — I’d love to connect with you and discuss all things nerdy!