data-science on francisco yirá's blog

Building an Airflow Pipeline That Talks to AWS — Data Pipelines in the Cloud (III)

Fri, 14 Jun 2024 00:00:00 +0000

This tutorial is a complete guide to building an end-to-end data pipeline with Apache Airflow that communicates with AWS services like RDS (relational database) and S3 (object storage) to perform data transformations automatically and efficiently.

Using Amazon Web Services (AWS) with the Command Line — Data Pipelines in the Cloud (II)

Sun, 12 May 2024 00:00:00 +0000

Welcome back to the 'Data Pipelines in the Cloud' series! In the first part, I introduced Airflow as a tool for orchestrating data pipelines and demonstrated how to code and execute a minimal Airflow pipeline (DAG) on your local environment. In this second part, we'll lay the ground to build a more functional Airflow DAG by using the AWS Command Line Interface to set up a relational database in the cloud (PostgreSQL), along with a bucket for object storage (S3). We'll then upload a sample CSV file to the bucket, which we'll later use as input for an Airflow DAG that performs a meaningful transformation on this data.

Sharing My Advent of Code 2023 with Quarto (And How You Can Do the Same)

Thu, 28 Dec 2023 00:00:00 +0000

As a Christmas enthusiast, I've always been intrigued by the Advent of Code, a series of daily programming puzzles leading up to Christmas. This year, I'm taking on the challenge with either R or Python, adding a touch of whimsy by using a spinning wheel to choose my language each day. I'm also sharing my solutions on a special Advent of Code-themed website. Find out how you can create your own Advent of Code site and automate the process with the `aochelpers` R package.

A Beginner's Introduction to Airflow with Docker — Data Pipelines in the Cloud (I)

Sun, 13 Aug 2023 00:00:00 +0000

Learn the essentials of Apache Airflow for creating scalable and automated data pipelines in the cloud with this comprehensive, step-by-step beginner’s guide. Discover what problem Airflow solves and under what circumstances is better to use it and run your first Airflow DAG on Docker with the Linux subsystem for Windows.

Matching in R (III): Propensity Scores, Weighting (IPTW) and the Double Robust Estimator

Sun, 01 May 2022 00:00:00 +0000

In the last part of this series about Matching estimators in R, we'll look at Propensity Scores as a way to solve covariate imbalance while handling the curse of dimensionality, and to how implement a Propensity Score estimator using the `twang` package in R. We'll also explore the importance of common support, the inverse probability weighting estimator (IPTW) and the double robust estimator, which combines a regression specification with a matching-based model in order to obtain a good estimate even when there is something wrong with one of the two underlying models.

Matching in R (II): Differences between Matching and Regression

Thu, 31 Mar 2022 00:00:00 +0000

Welcome to the second part of the series about Matching estimators in R. This sequel will build on top of the first part and the concepts explained there, so if you haven’t read it yet, I recommend doing so before you continue reading. But if you don’t have time for that, don’t worry. Here it’s a quick summary of the key ideas from the previous past that are required to understand this new post.

Matching in R (I): Subclassification, Common Support and the Curse of Dimensionality

Sun, 27 Feb 2022 00:00:00 +0000

Until this moment, the posts about causal inference on this blog have been centred around frameworks that enable the discussion of causal inference problems, such as Directed Acyclical Graphs (DAGs) and the Potential Outcomes model1. Now it’s time to go one step further and start talking about the “toolbox” that allows us to address causal inference questions when working with observational data (that is, data where the treatment variable is not under the full control of the researcher).

Randomization Inference in R: a better way to compute p-values in randomized experiments

Tue, 18 Jan 2022 00:00:00 +0000

Welcome to a new post of the series about the book Causal Inference: The Mixtape. In the previous post, we saw an introduction to the potential outcomes notation and how this notation allows us to express key concepts in the causal inference field. One of those key concepts is that the simple difference in outcomes (SDO) is an unbiased estimator of the average treatment effect whenever the treatment has been randomly assigned (i.

Analyzing my music collection with Python and R

Thu, 30 Dec 2021 00:00:00 +0000

A couple of months ago, I decided that it was time for me to finally grow out of my R comfort zone and start studying Python. I began my Python journey by reading the book Python for Data Analysis from Wes McKinney (creator of pandas, the Python equivalent of the tidyverse), and having finished it I wanted to put into practice what I’ve learned through an applied data analysis.

Potential Outcomes Model (or why correlation is not causality)

Thu, 09 Dec 2021 00:00:00 +0000

This article, the second one of the series about the book Causal Inference: The Mixtape, is all about the Potential Outcomes notation and how it enables us to tackle causality questions and understand key concepts in this field1. The central idea of this notation is the comparison between 2 states of the world: The actual state: the outcomes observed in the data given the real value taken by some treatment variable.

Introduction to causal diagrams (DAGs)

Sat, 31 Jul 2021 00:00:00 +0000

This article is the first in a series dedicated to the content of the book Causal Inference: The Mixtape, in which I will try to summarize the main topics and methodologies exposed in the book. DAGs (Directed Acyclic Graphs) are a type of visualization that has multiple applications, one of which is the modeling of causal relationships. We can use DAGs to represent the causal relationships that we believe exist between the variables of interest.