Hello! It’s me again. I’ve discussed parts of what I’m going to mention here in other articles, but now I want to give a few directions on what’s not __data science and how not to learn it.
So let’s start with the basics.
What is __data Science?
Data science not just knowing some programming languages, math, statistics and have “domain knowledge”.
The time has come. We’ve created a new field, or something like that. There’s a lot of things to say and study in this field. It doesn’t matter the name, maybe data science is just a temporary name for a bigger field, but the scientific study of data, getting insights from it and then be able to predict something is the present and future of the world.
I’ll focus on business related definitions and proposals for data science, maybe these can apply for the field as a whole, but the ideas in this article are about data science for business.
I’m going to propose three things:
- Data science is a science
- There are awful ways to learn data science
- Using well created cheatsheets can help you do data science in a systematic way
data Science is a science
I know this maybe controversial for some people but stick with me. What I want to say here is that data science is of course linked to the business, but it is a science in the end, or in the process of becoming one.
I defined data science before as:
[…] The resolution to Business / Organizations problems through mathematics, programming and the scientific method that involves the creation of hypotheses, experiments and tests through the analysis of data and the generation of predictive models. It is responsible for transforming these problems into well-posed questions that can also respond to the initial hypothesis in a creative way. It must also include the effective communication of the results obtained and how the solution adds value to the Business / Organization.
I’m stating here a description and definition of data science as a science. I think it could be very useful that data science can be described as a science because if that’s the case, every project this field should be at least:
- Reproducible: Necessary for making easy to test other’s work and analysis.
- Fallible: data science and science doesn’t look for the truth, they look for knowledge, so every project can be substituted or improved in the future, no solution is the ultimate solution.
- Collaborative: data scientists don’t exists alone, they need a team, this team will make things possible for developing intelligent solutions. Collaboration is a big part of science, and data science should not be an exception.
- Creative: Most of what data scientists do is new research, new approaches or takes on different solutions, so their environment should be very creative and easy to work. Creativity is crucial in science, is the only way we can find solutions to hard and complex problems.
- Compliant to regulations: Right now there are a lot of regulations in science, not that much in data science, but there will be more in the future. Is important that the projects we are building are aware of these different types of regulations so we can create a clean and acceptable solution for the problems.
If we don’t follow those basic principles it would be very hard to conduct a proper data science practice. data science should be implemented in a way that enables decision making to follow a systematic process. But more on that later.
How NOT to Learn data Science. The big 3.
If you are here it’s probable that you are learning data science right now, or you took some MOOCs or even classes on the field. I’m not going to talk bad here about platforms or bad courses, I think we something can learn even in the worst courses.
1. Seeing and seeing without practicing
If you are taking a class on anything related to data science, like math, statistics, programming or something like that, and you are there just listening to the class.
Well you are wasting you time. data science needs practice. Everything you learn, even though if the professor doesn’t tell you, practice and try it. This is fundamental to really comprehend things and when you are working in the field you will be doing a lot of different practical stuff.
A good knowledge on statistics, math and python won’t make you a successful data scientist. You need more, you need to master your craft. Be able to use these tools to solve business problems. So if you are learning something new, and you want to understand it for real, find a scenario where you can apply it or play with it.
2. Creating models in a crazy way
We get the data from the “outside world” and our body and brain analyze the raw data we got, and then we “interpret” things.
What is this “interpretation”? Just what we’ve learned about how to react, think, feel and understand from the information we are getting. When we are understanding we are decoding the parts that forms this complex thing, and transforming the raw data we got in the beginning into something useful and simple.
We do this by modeling. This is the process of understanding the “reality”, the world around us, but creating a higher level prototype that will describe the things we are seeing, hearing and feeling, but it’s a representative thing, not the “actual” or “real” thing.
So think before you do this:
Modeling is something very important in the machine learning and data science space, but they must have a purpose. And you have to understand them before using them. Now what they assume from the data before training it, understand the different metrics they use to learn, how to evaluate them and more.
For that I can tell you, there’s no harm in reading the documentations of libraries like Scikit-Learn:
A tutorial on statistical-learning for scientific data processing - scikit-learn 0.20.3…
Machine learning is a technique with a growing importance, as the size of the datasets experimental sciences are facing…scikit-learn.org
MLlib: Main Guide - Spark 2.4.1 Documentation
Due to licensing issues with runtime proprietary binaries, we do not include netlib-java's native proxies by default…spark.apache.org
TensorFlow Guide | TensorFlow Core | TensorFlow
sessions, which are TensorFlow's mechanism for running dataflow graphs across one or more local or remote devices. If…www.tensorflow.org
And more. They’ll lead you to articles, papers and more blog posts and most of them will even have practical examples on how to do modeling in machine learning and statistical learning.
Also there are great videos in the field that will take you from zero to hero in minutes like the ones from my friend Brandon Rohrer:
3. “ Yeah, I’m a lone wolf. I can study and do everything by myself”
Remember that one of the characteristics I proposed before is that data science is a collaborative field. Well studying it too!
I’m not saying here that you need to start a course with your BFF but make use of what the online platforms give us today. We have forums, chats, discussion boards and more where you can meet people learning the same things you are learning. It will be much easier to learn with more people, and don’t be afraid on asking questions.
Ask as many questions as yo need to understand something, and don’t rest until you do it. Don’t harass people either, but if you ask politely most people will be more than happy to help you.
Here are great resources (apart from the ones the MOOCs and courses have on their inside) to find people learning the same things as you:
Stack Overflow - Where Developers Learn, Share, & Build Careers
Stack Overflow is the largest, most trusted online community for developers to learn, share their programming…stackoverflow.com
Quora is a place to gain and share knowledge. It's a platform to ask questions and connect with people who…www.quora.com
Deep Cognition Community
An active community working together to drive growth and innovation through AI.community.deepcognition.ai
r/datascience: A place for data science practitioners and professionals to discuss and debate data science career…www.reddit.com
Systematic data Science with cheatsheets
Cheatsheets save you time by providing bits of knowledge about different pieces of the language, concepts or libraries. Some cheatsheets also contains hyperlinks to the documentation and package-level cheat sheets for the most important packages in R, Python, Scala and more.
In the end of last year, I created a repository that went viral about all the different cheatsheets you can use to do data science.
List of data Science Cheatsheets to rule the world - FavioVazquez/ds-cheatsheetsgithub.com
In the repo you’ll find cheatsheets about:
- Business Science
- Math and Calculus
- Big data
- Machine Learning
- Deep Learning
- Data Visualization
- Data Science in General and Others
There you’ll find a PDF and a PNG version of every cheatsheet. Feel free to download the repo as a zip to get all the information, and if you find or create a new one that you think it’s useful please create a pull request with it.
Thanks for reading, and hopefully this can help you find the path to be a successful professional in the data world. More in the future :)
Bio: Favio Vazquez is a physicist and computer engineer working on data Science and Computational Cosmology. He has a passion for science, philosophy, programming, and music. He is the creator of Ciencia y Datos, a data Science publication in Spanish. He loves new challenges, working with a good team and having interesting problems to solve. He is part of Apache Spark collaboration, helping in MLlib, Core and the Documentation. He loves applying his knowledge and expertise in science, data analysis, visualization, and automatic learning to help the world become a better place.
Original. Reposted with permission.
- Was it Worth Studying a data Science Masters?
- The most desired skill in data science
- How to Recognize a Good data Scientist Job From a Bad One