Integrating PySpark with SQL server using JDBC
In my series on connecting different sources to Spark I have explained how to connect to S3 and Redshift. To further extend my trials I show a quick demo on how to connect to a SQL server using JDBC.
In my series on connecting different sources to Spark I have explained how to connect to S3 and Redshift. To further extend my trials I show a quick demo on how to connect to a SQL server using JDBC.
Previously I was using powerlevel9k as theme for my iTerm2 Zsh configuration. Recently I had to install a new MacBook and found an easier way to make the terminal look fancier. powerlevel10k is the better version of powerlevel9k, especially since it has a configuration prompt where the installer guides you through all the changes you can make to the style.
For Mac it is as simple as the following few lines, assuming you have brew installed.
$ brew install romkatv/powerlevel10k/powerlevel10k
$ echo 'source /usr/local/opt/powerlevel10k/powerlevel10k.zsh-theme' >>! ~/.zshrc
$ p10k configure
Using parametrize writing tests becomes significantly easier. Instead of writing a test for each combination of parameters I can write one test with a list of different sets of parameters. A short example..
In my article on how to connect to S3 from PySpark I showed how to setup Spark with the right libraries to be able to connect to read and right from AWS S3. In the following article I show a quick example how I connect to Redshift and use the S3 setup to write the table to file.
In my post Using Spark to read from S3 I explained how I was able to connect Spark to AWS S3 on a Ubuntu machine. Last week I was trying to connect to S3 again using Spark on my local machine, but I wasn't able to read data from our datalake. Our datalake is hosted in the eu-west-2 region which apparently requires you to specify the version of authentication. Instead of setting up the right environment on my machine and reconfigure everything, I chose to update the Docker image from my notebook repo so I could test on my Mac before pushing it to my server. Instead of configuring both my local and remote environment I can simply spin up the Docker container and have two identical environments.
import pyspark.sql.functions as F
from pyspark.sql.types import IntegerType
# Cast the count column to an integer
dataframe.withColumn("count", F.col("count").cast(IntegerType()))
In this project I want to verify the availability of the APIs that we use to ingest data into our data platform. In the example I will use Jira, Workable and HiBob, since they offer clean APIs without too much configuration. First I will create a test suite to verify the availability and once this works move it to a Lambda function that could be scheduled with CloudWatch on a fixed schedule.
This project contains a simple example on how to build a resume with Python using Jinja, HTML, Bootstrap and a data file. In the past I have always created my resume with Latex, but to make life a little easier I chose to switch to a combination of Python and HTML. Maintaining a Latex document is cumbersome and it is difficult to divide the data from the style. By using Jinja it is straightforward to separate the resume data from the actual layout. And the most important part, I can stick to Python!
A simple introduction to create fake data using the Faker tool in Python. Very convenient if you need to generate dummy data for an experiment.
In my two previous articles Unittesting in a Jupyter notebook and Mocking in unittests in Python I have discussed the use of unittest and mock to run tests for a simple Castle and Character class. For the code behind this article please check Github.