Statistics and Data Science advice needed

Angry Beaver · 03-07-2016, 01:10 PM

I have a humanities background which has unfortunately provided me with limited knowledge and skills with quantitative research. I'm trying to enhance my research skills and I'm looking into statistics - been reading some course materials on my own and having a go on Khan Academy - as in the next 10 years no piece of research probably won't accepted without strong number proof.

My aim is to make myself marketable to consultancy firms that look into political/security risk, still number proficiency is a must to make strong arguments. However, I'm fairly certain that there are nice open-source programs out there which can help me get halfway.

Which are some of the easiest and most user friendly (big) data analysis tools out there that I can start trying out in order to feel more comfortable? I'll happily take all book recommendations, web sources etc. in this field too.

WeekendCasanova · 03-07-2016, 02:10 PM

I didn't get a statistics degree, but I took a lot of statistics courses for my Econ/Finance degree. I'm not sure if it's the same now, but when I was in University, all of the professors made us use tools from IBM, or Oracle, which both are apparently used industry-wide.

I wouldn't call them 'user-friendly', but they offer a lot of programs designed for what you're looking for, and they don't take a long time to learn. SPSS, and R are the ones I'm most familiar with personally, and they're pretty powerful.

Otherwise, there's always trusty Excel.

~~NASA Test Pilot~~ · 03-07-2016, 04:11 PM

SPSS is generally a good package that is more for the social sciences and has been around conceptually for a long time. You might also examine software relating to Statistical Process Control (SPC), SAS is very strong; and for more Operations Research look at Analytica, Risk (modeling) and Crystal Ball (forecasting and optimization); if you want to go deep, look into Mathematica (to include Operations Research specializations in Mathematica).

Excel has always been strong and you can even do a lot of program with Excel. Try to find some original user guides at or before Excel 4.0 (old stuff) and you will see some of the programming depth available a few decades back (you could even do simulations), that most people just expect to be in a GUI interface now, but it lacks customization. I have used or had my teams use all of these in a previous life related to Operations Research and war planning. In general understand what types of things you want to do on the front end; forecasting, simulation (to include routing), Linear Programing (LP), Non-Linear Programming (NLP), basic statistics, decision analysis, optimization, supply chain management, routing, etc. and then go find the software.

I provide the above because you mentioned political/security risk and I have some experience in this area.

Ensam · 03-07-2016, 07:20 PM

Python has a number of good tools for dealing with large data sets and doing analysis (check out Pandas). There are a few features of python (e.g. generators) that make it particularly well suited for handling very large amounts of data. It's also worth familiarizing yourself with map-reduce techniques. I find it's useful for organizing my analysis even on small data sets.

In terms of stats knowledge, get familiar with both the frequentist and Bayesian techniques. Most casual users (and consumers) of data are familiar frequentist ideas but the hard core machine learning guys are all doing Bayesian stuff these days.

As others have mentioned, R and SAS are standard tools. JMP is another common package, especially in engineering environments.

polar · 03-08-2016, 09:09 AM

Don't go into big data until you can work with small and medium data. The conceptual knowledge remains the same, whether it's 100 data points or 1,000,000. The difference is that the size will determine the software you have to use.

Unless you're a specialist within consulting (on a data science team - most of these guys have masters degrees or above in math-related fields), you're not going to look at BIG data sets by yourself. I wouldn't expect political or security consultancies like Eurasia Group to use very large data. If you're looking at entry level, get very familiar with Excel and just familiar enough with R, SAS, SPSS or Python to be able to do some simple tasks and have it on your resume.

Unless you are using more than 100,000 rows of data or need accuracy to the decimals past 1,000,000,000 (such as the hard sciences), Excel is the best place to start. There's a ton that you can do in Excel.

Statistics and t-tests? Excel.
Pivot tables to break down and analyze complex data sets? Excel.
Scenario and sensitivity analysis? Excel.
Solver for best-fit formula? Excel.
Trace formulas through your entire model to understand assumptions? Excel.
Charts and graphs? Excel.
Automation of repetitive tasks you perform with your data? Excel (VBA).

What you can't do well in Excel is one-to-many lookups (example, types of plane tickets - economy, business, first class) and BIG data sets (it starts having issues over 100k rows). For that - look at MS Access and SQL. That gets you up to 3 million rows or so, and you can link it to an Excel pivot table for others to use.

Look around here for other data analyst and data science threads.

Angry Beaver · 03-08-2016, 12:52 PM

Tons of leads here, thank you everyone!

ksbms · 03-09-2016, 06:48 AM

Quote: (03-07-2016 02:10 PM)WeekendCasanova Wrote:

I didn't get a statistics degree, but I took a lot of statistics courses for my Econ/Finance degree. I'm not sure if it's the same now, but when I was in University, all of the professors made us use tools from IBM, or Oracle, which both are apparently used industry-wide.

I wouldn't call them 'user-friendly', but they offer a lot of programs designed for what you're looking for, and they don't take a long time to learn. SPSS, and R are the ones I'm most familiar with personally, and they're pretty powerful.

Otherwise, there's always trusty Excel.

R trumps SPSS any time of day or night.

future · 03-09-2016, 08:00 AM

Quote: (03-07-2016 01:10 PM)Angry Beaver Wrote:

I'll happily take all book recommendations, web sources etc. in this field too.

Data Science consists of three things: math, statistics, and programming. You said you have a humanities background. So I don't know how much of each you know already. Assuming you have a minimal background in each, here are some resources to get you started:

Khan Academy

I'd recommend going through the sections on probabilities because that will be most useful. Also, at some point you may want to learn linear algebra if you want to get deeper into data science.

dataquest.io

They've got great hands-on tutorials on using Python for data analysis. Although some of the advanced tutorials require a paid subscription, the free ones should be enough to get you started. Once you have completed those you should be able to set up a Python environment on your computer and start doing some data analysis of your own.

CodeCademy

Dataquest will be useful assuming you have at least some experience with programming. If you have none, starting with something more basic such as CodeCademy will be better.

Think Stats

That's an excellent book to understand the basics of statistics. The author has a very pragmatic style of explaining things. Instead of getting into a lot of theory, he writes quick Python code to get you through stuff. So this book is good for learning statistics as well as sharpening your Python skills. And it's all available for free.

Kaggle

That's an online community of statistics/machine learning experts and wannabe experts. It's worth hanging out in their forums just to learn what's hot in the field and what are some good resources to learn from. They also maintain a wiki with links to lots of very cool tutorials online. At some point you may also want to participate in one of their challenges. But that will probably come later, once you understand some decent amount of machine learning.

Good luck, and let me know if there's something more specific that you're looking for.

cibo · 03-09-2016, 08:06 AM

Excel is a good starting point since everything you do conceptually will be the same for the most part when you start moving to more specialized tasks.

Unless you're doing something more specialized needing to fit statistical/ machine learning models or pulling the data from databases, 80%-90% of what you need to do can be done in excel. I doubt you're working with data over 100k of records which is what many consider the break point where you start using other tools.

Once you get familiar with data work, picking up a statistical program is good a idea. R, SAS, Python (scipy+pandas+sci-learn) are good. I'd recommend python more so since if you work with web data, you're pulling from API's which python is better at compared to R/SAS. This is mainly because python is a programming language first and stats tool second which is the exact reason some people prefer R over python. I never really liked R's syntax though and some of its design choices but its a good program. SAS is decent too but most people don't have access to it unless they work for a big company since they run about $2K a licence. I practiced a bit on gimped student copy in grad school that couldn't even run a stats model until I worked with a real copy.

It would be a good idea to learn SQL at some point if you want to work with any databases which contains what many consider to be "Big Data". Most databases are relational which use SQL. For the ones that aren't, most have SQL packages to make them more accessible. Also many web apis are structured to use a SQL like language for getting data from them, facebook being an example.

Also SPSS is shit. It is Excel with less overall features and clunker version of what better stats programs offer.

Silver_Tube · 03-09-2016, 08:23 AM

I'm going down this route myself, from a programming background. I favor books and courses when I'm trying to lean something. If I had to start at zero I'd go through this sequence:

Learn Python the Hard Way (web tutorial)
Lean SQL the Hard Way (web tutorial)
MITx 6.00.1x and 6.00.2x (edx)
Excel to MySQL: Analytic Techniques for Business (coursera)
Python for Data Analysis (book by the author of the pandas library)

Positions for this kind of work are friggin everywhere. You can be a business analyst or a report writer, if you fail at it you could go into IT and work your way into database admin as a plan B. I'm doing sql crap for a bank now, trying to make the jump to proper analyst. If I stay in the area that will likely mean doing this for a trucking company something.

InsertNameHere · 03-09-2016, 09:30 AM

Quote: (03-07-2016 01:10 PM)Angry Beaver Wrote:

I have a humanities background which has unfortunately provided me with limited knowledge and skills with quantitative research. I'm trying to enhance my research skills and I'm looking into statistics - been reading some course materials on my own and having a go on Khan Academy - as in the next 10 years no piece of research probably won't accepted without strong number proof.

My aim is to make myself marketable to consultancy firms that look into political/security risk, still number proficiency is a must to make strong arguments. However, I'm fairly certain that there are nice open-source programs out there which can help me get halfway.

Which are some of the easiest and most user friendly (big) data analysis tools out there that I can start trying out in order to feel more comfortable? I'll happily take all book recommendations, web sources etc. in this field too.

Wow, it's uncanny how closely your background and aspirations match mine, and I was just thinking about the same issues as well.

I'm wondering how much effectiveness/legitimacy one can get from self-teaching in the way you described. I have the chance to do a 1-year MSc in Data Science from a reputable school for free. However, the program is only 1 semester of full-time class with an apprenticeship for the 2nd semester, and that's a full year of my life that I could dedicate towards teaching myself or improving from on-the-job training.

Those of you who know the field: is formal training necessary/worth it, or can one realistically get a decent foundation through self-study that will also be seen as legitimate enough to get jobs?

Peregrine · 03-09-2016, 11:40 AM

Quote: (03-09-2016 09:30 AM)InsertNameHere Wrote:

Quote: (03-07-2016 01:10 PM)Angry Beaver Wrote:

I have a humanities background which has unfortunately provided me with limited knowledge and skills with quantitative research. I'm trying to enhance my research skills and I'm looking into statistics - been reading some course materials on my own and having a go on Khan Academy - as in the next 10 years no piece of research probably won't accepted without strong number proof.

My aim is to make myself marketable to consultancy firms that look into political/security risk, still number proficiency is a must to make strong arguments. However, I'm fairly certain that there are nice open-source programs out there which can help me get halfway.

Which are some of the easiest and most user friendly (big) data analysis tools out there that I can start trying out in order to feel more comfortable? I'll happily take all book recommendations, web sources etc. in this field too.

Wow, it's uncanny how closely your background and aspirations match mine, and I was just thinking about the same issues as well.

I'm wondering how much effectiveness/legitimacy one can get from self-teaching in the way you described. I have the chance to do a 1-year MSc in Data Science from a reputable school for free. However, the program is only 1 semester of full-time class with an apprenticeship for the 2nd semester, and that's a full year of my life that I could dedicate towards teaching myself or improving from on-the-job training.

Those of you who know the field: is formal training necessary/worth it, or can one realistically get a decent foundation through self-study that will also be seen as legitimate enough to get jobs?

Formal training is usually required.

You should talk to people in the field you want to enter and ask them about the specific MSc you're considering. They'll know whether it's worth it.

cibo · 03-10-2016, 04:36 AM

For most data Sci positions they require at least a masters. I've hired data scientists and the work leads itself to a lot of academic thinking which needs some formal training. After the masters, I expect self study since there's always some new tech coming down the pipeline.

Post or PM the link for the msc and I can tell you if the program is decent.

Wreckingball · 03-10-2016, 09:42 AM

Quote: (03-07-2016 01:10 PM)Angry Beaver Wrote:

I have a humanities background which has unfortunately provided me with limited knowledge and skills with quantitative research. I'm trying to enhance my research skills and I'm looking into statistics - been reading some course materials on my own and having a go on Khan Academy - as in the next 10 years no piece of research probably won't accepted without strong number proof.

My aim is to make myself marketable to consultancy firms that look into political/security risk, still number proficiency is a must to make strong arguments. However, I'm fairly certain that there are nice open-source programs out there which can help me get halfway.

Which are some of the easiest and most user friendly (big) data analysis tools out there that I can start trying out in order to feel more comfortable? I'll happily take all book recommendations, web sources etc. in this field too.

Excel. It's free (if you know what I mean), it's easy and there's shitload of info on the internet. It does everything you will ever need for statistics.

Coursera and Khan academy should also have decent courses on statistics.

GillesDeleuze · 03-10-2016, 09:54 AM

Is it easier to go location independent as a front end developer or data scientist?

From surfing sites like upwork and the usual suspects it seems that there are more opportunities as a developer rather than data scientist

At the moment I'm going through calculus and discrete math as prerequisites to understand the logic behind programming, but would like to know if data science could be a possible alternative, if it is the case i switch to subjects like statistics and follow ups of calculus (multivariable calculus and linear algebra) and then basics of Python for data analysis plus R

InsertNameHere · 03-10-2016, 12:42 PM

Quote: (03-10-2016 04:36 AM)cibo Wrote:

For most data Sci positions they require at least a masters. I've hired data scientists and the work leads itself to a lot of academic thinking which needs some formal training. After the masters, I expect self study since there's always some new tech coming down the pipeline.

Post or PM the link for the msc and I can tell you if the program is decent.

Thanks for the help. These are the two programs I'm looking at:

http://www.ensae.fr/formations-navhorizo...rs-3a.html

http://datascience-x-master-paris-saclay...ignements/

I couldn't find a description for either program written entirely in English, but most of the class titles themselves are either in English or their meaning is pretty apparent.

Both the programs involve schools that are well-respected in France, although the second is a bit more prestigious (one of the partners is the French equivalent to MIT). As far as I can tell, the former is a little more flexible for personalisation, while the latter is a bit more technical.

Thoughts?

cibo · 03-12-2016, 02:56 PM

Quote: (03-10-2016 12:42 PM)InsertNameHere Wrote:

Quote: (03-10-2016 04:36 AM)cibo Wrote:

For most data Sci positions they require at least a masters. I've hired data scientists and the work leads itself to a lot of academic thinking which needs some formal training. After the masters, I expect self study since there's always some new tech coming down the pipeline.

Post or PM the link for the msc and I can tell you if the program is decent.

Thanks for the help. These are the two programs I'm looking at:

http://www.ensae.fr/formations-navhorizo...rs-3a.html

http://datascience-x-master-paris-saclay...ignements/

I couldn't find a description for either program written entirely in English, but most of the class titles themselves are either in English or their meaning is pretty apparent.

Both the programs involve schools that are well-respected in France, although the second is a bit more prestigious (one of the partners is the French equivalent to MIT). As far as I can tell, the former is a little more flexible for personalisation, while the latter is a bit more technical.

Thoughts?

So data science pretty much has 2 major branches trying to own the term.

1) The statistical branch that is into the theory of probability and how to utilize it to analyze data. It is mainly uses different shades of regression, curve fitting, forecast methods, and experiment design. They created most of the modern statistical methods from p-values, to Bayes, cluster analysis, etc. They use statistical programs/programming languages but they tend to work with cleaner data that have more consistent structure and in general more conservative in their approaches.

2) The other branch comes from computer science. They are more into how to work with the data and how to analyze data at scale. Most of the newer methods from data science have been coming out of this area: text analytics, decision trees, deep learning neutral networks etc. People coming from this branch use more traditional programming languages and may or may not use a statistical programming language to solve their data problems. They work with poorer data sources that may not be cleaned and can be unstructured (free text, images, sound). They tend to be less concerned with theoretical correctness and more into computer process time and algorithm design. In general, a bit more willing to try new approaches.

The first branch is more positioned for the stats/theory heavy parts of data science: forecasting, risk modeling, and research studies. This usage of data science is fairly intertwined with economics at this point and I would say this program leads itself to academia, financial institutions, think tank and government policy work.

The second branch is more towards the data processing side of data science. It leads itself to the tech industry and some hedge funds. Google epitomes that branch. Most of their machine learning models are theoretically simple but how they apply those models at scale is very impressive. When people are talking about “Big Data” it is usually this side of the data science branch.

The masters you posted fit nicely into those two paradigms I see come up all the time. And I think both will give a decent foundation to begin your career in data science for the most part.

I did notice neither program mentions anything about databases. And this has been a traditional gap I’ve seen in most data science masters. In industry, most of the data you will analyze will be on databases. When you start working is millions and billions of records, you will not be able to crunch the data on your laptop and most cases will need to use a data base querying engine of some sort. When you get to what data scientists (not business people) consider “Big Data”, you’re working with Hadoop and other distributed databases to process your data.

Most data scientists, unless their coming from a strong computer science background, are quite weak on database fundamentals. I don’t think anyone expects you to design all aspects of a database schema, but understanding joins in SQL in a basic relational database and the trade-offs between normalize/denormalized databases would go a long way. Most people will figure this stuff out pretty quickly since it’s not terrible hard but it’s another thing you need to learn when you might be already struggling to adjust from academia to the real world.

~~jj90~~ · 03-13-2016, 03:16 PM

Quote: (03-10-2016 09:54 AM)GillesDeleuze Wrote:

Is it easier to go location independent as a front end developer or data scientist?

From surfing sites like upwork and the usual suspects it seems that there are more opportunities as a developer rather than data scientist

I can only share the front end side but generally yes if your company is flexible and the structure you have with them is conducive to it. At my work we remote into the server and thus we can work from anywhere in the world. Many on the dev team here start out as contractors, which means you work and bill as you please. For my limited time at this place so far, a few months, I've been in the office for maybe a grand total of 10 days. Depends on seniority too, I'm pretty junior and get grunt work, but grunt work only needs to get done, I don't actually need to be there. The senior devs need to be a lot more accessible to management for meetings.

Peregrine · 03-13-2016, 06:33 PM

Great summary, cibo. Thanks for that.

Quote: (03-13-2016 03:16 PM)jj90 Wrote:

Quote: (03-10-2016 09:54 AM)GillesDeleuze Wrote:

Is it easier to go location independent as a front end developer or data scientist?

From surfing sites like upwork and the usual suspects it seems that there are more opportunities as a developer rather than data scientist

I can only share the front end side but generally yes if your company is flexible and the structure you have with them is conducive to it. At my work we remote into the server and thus we can work from anywhere in the world. Many on the dev team here start out as contractors, which means you work and bill as you please. For my limited time at this place so far, a few months, I've been in the office for maybe a grand total of 10 days. Depends on seniority too, I'm pretty junior and get grunt work, but grunt work only needs to get done, I don't actually need to be there. The senior devs need to be a lot more accessible to management for meetings.

From my experience, dev work is more easily done remotely for the simple reason that employers like to have their "data science" guys physically available to explain stuff to senior people.

Which leads me to another point: it's not enough to become proficient at the tool of data science. As the saying goes, there's lies, damn lies, and then there's statistics. You can make the data say whatever you damn please. The important value add is having accurate mental models of reality in your head (i.e. "common sense") so that you can gut check your work. Having the right assumptions are just as important as performing the analysis correctly.

Unless, of course, you know management wants you to come up with a certain finding from the data. Then you'll have to decide whether you're willing to go along with it.

cibo · 03-14-2016, 12:53 AM

I do think its very possible to work remotely. All you have to do is remote in and do your work on their server.

One issue that makes it hard for you to become location independent is that you have to explain your analysis. It is be possible explain how your models and analysis works by phone but if you work full time for a company they tend to be a bit more bitchy about you being in the office. This tendency tends to increase with seniority. I was able to duck out from the office when I had less responsibility but its getting harder lately.

Separating out my independent consulting work, the best I had it as a full time employee was one week on and one week from home when I was working at a consulting firm and the client was pretty chill about me traveling to them since they had to pay for my travel expenses.

Once you do your own thing, you can visit maybe one a week or one every other week if your clients require face time. If they aren't super bitchy about facetime, you can probably go a few months without seeing them and communicate by phone.

Tayo · 11-11-2018, 09:20 AM

Which one did you go for - data science or front end development?

Quote: (03-10-2016 09:54 AM)GillesDeleuze Wrote:

Is it easier to go location independent as a front end developer or data scientist?

From surfing sites like upwork and the usual suspects it seems that there are more opportunities as a developer rather than data scientist

At the moment I'm going through calculus and discrete math as prerequisites to understand the logic behind programming, but would like to know if data science could be a possible alternative, if it is the case i switch to subjects like statistics and follow ups of calculus (multivariable calculus and linear algebra) and then basics of Python for data analysis plus R

Login
Username:
Password:	Lost Password?
	Remember me