Add examples and explanations for benchmarking, slice comparison, and table-driven testing in Go.

This commit is contained in:
ErdemOzgen 2024-01-08 21:20:01 +03:00
parent ffd5016922
commit 1557e78bd5
10 changed files with 912 additions and 1 deletions

View File

@ -0,0 +1,13 @@
# Cloud Data Warehouses
Last updated Dec 8, 2023
- Snowflake
- Google BigQuery
- Amazon Redshift
- Azure SQL Data Warehouse
- Firebolt
Another approach is that you change your on-premise Data Warehouse to a **Cloud Data Warehouse** to get more scalability, more speed, and better availability. This option is best suited for you if you do not necessarily need the fastest response time and you do not have - or petabytes of data. **The idea is to speed up your DWH and skip the layer of cubes. This way you save much time in the development, processing, and maintenance of [OLAP Cubes](https://www.ssp.sh/brain/olap).** On the other hand, you lose in query latency while you create your dashboards. If you mainly have reports anyway, which can run beforehand, then this is perfect for you.

View File

@ -0,0 +1,90 @@
# Data Engineering Architecture
A good data engineering architecture is hard.
But the best one is the overview from Emerging Architectures for Modern Data Infrastructure - a16z (or Zotero: The Guide to Modern Data Architecture | Future). Check also Perfect DWH architecture.
## Data lake vs Data Warehouse
The concepts of "data lake" and "data warehouse" are both related to big data and analytics, but they serve different purposes and have distinct characteristics. Let's explore each of them:
### Data Lake
1. **Definition**: A data lake is a vast pool of raw data, the purpose of which is not yet defined. It stores structured, semi-structured, and unstructured data. The data is kept in its native format until it is needed.
2. **Flexibility**: Since data lakes retain all data in its raw form, they offer high flexibility to analysts and data scientists to apply different types of analytics, from dashboards and visualizations to big data processing, real-time analytics, and machine learning.
3. **Users**: Data lakes are mainly used by Data Scientists and Data Engineers who need to perform exploratory data analysis, and who are skilled in using advanced analytical tools and techniques.
4. **Scalability and Cost**: Data lakes, especially when implemented in a cloud environment, can scale easily to store and process large amounts of data, and they can be more cost-effective in terms of storage.
5. **Data Quality and Governance**: One challenge with data lakes is managing the quality and governance of the data. Without proper management, data lakes can become unmanageable and turn into what is sometimes called a “data swamp”.
### Data Warehouse
1. **Definition**: A data warehouse is a system used for data analysis and reporting. It is a central repository of integrated data from one or more disparate sources. The data stored in a data warehouse is structured and processed.
2. **Purpose and Structure**: The data in a data warehouse is processed, structured, and used for specific purposes like reporting and analysis. It's organized in a way to quickly provide insights (like sales performance, operational efficiency, etc.).
3. **Users**: Data warehouses are typically used by business professionals like Business Analysts, who rely on data for making strategic decisions. They are not necessarily technical experts in data processing.
4. **Performance and Complexity**: Data warehouses are highly efficient at handling queries and are optimized for read-access and simplicity, making them suitable for less complex queries that are repeated frequently.
5. **Maintenance and Cost**: They generally require more maintenance, including data cleaning and data integration tasks. This can make them more expensive to operate compared to data lakes.
In summary, a data lake is a large-scale storage solution for raw, unstructured data, which offers flexibility for different types of data processing and analysis. A data warehouse, in contrast, is a structured repository of processed and refined data, designed for specific analytical tasks and queries. The choice between a data lake and a data warehouse depends on the specific needs, the nature of the data, and the intended use cases.
## # Common Architectures
- Examples and Types of Data Architecture:
- [[Data Warehouse]]
- [[Cloud Data Warehouse]]
- Data Marts
- [[Data Lake]]
- [[Modern Data Stack]]vs [Open Data Stack](https://www.ssp.sh/brain/open-data-stack)
- Lambda Architecture vs Kappa Architecture, Medallion Architecture
- [MapReduce](https://www.ssp.sh/brain/mapreduce) vs [Hadoop](https://www.ssp.sh/brain/hadoop)
- Metrics Layer vs Semantic Warehouse vs Data Virtualization
- [Metrics](https://www.ssp.sh/brain/metrics), Key Performance Indicator (KPI)
- Push-Downs vs rollup
- Data Modeling vs [Dimensional Modeling](https://www.ssp.sh/brain/dimensional-modeling)
- Data Contracts
- Delta Lake architecture where they unified batch processing and Streaming
- Architecture for IoT
- [Data Mesh](https://www.ssp.sh/brain/data-mesh)
- Data architectures have countless other variations
- Data Fabric, data hub, scaled architecture, metadata-first architecture, event-driven architecture, live data stack, and many more. Reis and Housley - Fundamentals of Data Engineering.pdf
Many overlap and more I have listed in [Data Modeling](https://www.ssp.sh/brain/data-modeling).
[
## # Data Architectures Images
](https://www.ssp.sh/brain/data-engineering-architecture/#data-architectures-images)
Updated:
![](https://www.ssp.sh/brain/Pasted%20image%2020220812142059.png)
by [Emerging Architectures for Modern Data Infrastructure | Andreessen Horowitz](https://a16z.com/emerging-architectures-for-modern-data-infrastructure/)
[
### # People (on top of the illustration)
](https://www.ssp.sh/brain/data-engineering-architecture/#people-on-top-of-the-illustration)
![](https://www.ssp.sh/brain/Pasted%20image%2020221222063638.png)
RW The Modern Data Graph - By Stephen Bailey
![](https://www.ssp.sh/brain/Pasted%20image%2020220405073024.png)
![](https://www.ssp.sh/brain/Pasted%20image%2020220503123606.png)
See a collection of many more on Data Engineering Architectures Overview or 4 Data Architectures.pdf, or Perfect DWH architecture.

View File

@ -0,0 +1,94 @@
# Data Engineering Lifecycle
Last updated Dec 6, 2023
Table of Contents
1. [[[Undercurrents]]](https://www.ssp.sh/brain/data-engineering-lifecycle/#undercurrents)
2. [My Fundamentals:](https://www.ssp.sh/brain/data-engineering-lifecycle/#my-fundamentals)
1. [[[Undercurrents]]](https://www.ssp.sh/brain/data-engineering-lifecycle/#undercurrents-1)
2. [Core Principles and Links](https://www.ssp.sh/brain/data-engineering-lifecycle/#core-principles-and-links)
In todays dynamic environment, a data engineer is responsible for managing the entire data engineering process. This encompasses gathering data from diverse sources and preparing it for use in downstream applications. Mastery of the various stages of the data engineering lifecycle is crucial, along with a knack for assessing data tools to ensure they deliver on multiple fronts: cost-effectiveness, speed, flexibility, scalability, user-friendliness, reusability, and interoperability.
![](https://www.ssp.sh/brain/data-engineering-lifecycle.png)
The data engineering lifecycle, as depicted by [Fundamentals of Data Engineering](https://www.oreilly.com/library/view/fundamentals-of-data/9781098108298/)
Alternatively, refer to this visualization in a [Tweet](https://twitter.com/mattarderne/status/1604528546784870402/photo/1):
![](https://www.ssp.sh/brain/data-engineering-data-flow-problems.png)
Further insights can be found in [Data Engineering Architecture](https://www.ssp.sh/brain/data-engineering-architecture) (e.g., the one from A16z).
> Example Open Data Stack Project
>
> In our [Open Data Stack](https://www.ssp.sh/brain/open-data-stack) project, we delve into the essential components of the lifecycle, such as ingestion, transformation, analytics, and machine learning.
Discover more at [The Evolution of The Data Engineer: A Look at The Past, Present & Future](https://airbyte.com/blog/data-engineering-past-present-and-future).
[](https://www.ssp.sh/brain/data-engineering-lifecycle/#undercurrents)
## [#](https://www.ssp.sh/brain/data-engineering-lifecycle/#undercurrents) Undercurrents
These are the core pillars of the lifecycle, omnipresent across its various stages: security, data management, DataOps, data architecture, orchestration, and software engineering.
The lifecycles functionality hinges on these undercurrents.
[
## # My Fundamentals:
](https://www.ssp.sh/brain/data-engineering-lifecycle/#my-fundamentals)[
# # Data Engineering Lifecycle
](https://www.ssp.sh/brain/data-engineering-lifecycle/#data-engineering-lifecycle)
In todays landscape, a data engineer is pivotal in overseeing the entire data engineering process. This involves gathering data from diverse sources and ensuring its availability for downstream applications. A deep understanding of the various stages in the data engineering lifecycle is essential. Additionally, a data engineer must possess the skill to evaluate data tools effectively, considering various aspects such as cost, speed, flexibility, scalability, user-friendliness, reusability, and interoperability.
![](https://www.ssp.sh/brain/data-engineering-lifecycle.png)
Illustration of the data engineering lifecycle, from [Fundamentals of Data Engineering](https://www.oreilly.com/library/view/fundamentals-of-data/9781098108298/)
Another perspective can be seen in this [Tweet](https://twitter.com/mattarderne/status/1604528546784870402/photo/1):
![](https://www.ssp.sh/brain/data-engineering-data-flow-problems.png)
For more insights, see [Data Engineering Architecture](https://www.ssp.sh/brain/data-engineering-architecture), such as the one from A16z.
> Case Study: Open Data Stack Project
>
> The Open Data Stack project exemplifies practical application, incorporating key lifecycle components like ingestion, transformation, analytics, and machine learning.
Further reading: [The Evolution of The Data Engineer: Past, Present & Future](https://airbyte.com/blog/data-engineering-past-present-and-future).
[](https://www.ssp.sh/brain/data-engineering-lifecycle/#undercurrents-1)
## [#](https://www.ssp.sh/brain/data-engineering-lifecycle/#undercurrents-1) Undercurrents
These are the foundational elements of the lifecycle, pervasive throughout its various stages: security, data management, DataOps, data architecture, orchestration, and software engineering. The lifecycle cannot function effectively without these integral undercurrents.
[
## # Core Principles and Links
](https://www.ssp.sh/brain/data-engineering-lifecycle/#core-principles-and-links)
Here are the above core principles of the engineering lifecycle, added with my own thoughts or features.
- Data Integration (Ingestion)
- Transformation
- [Semantic Layer](https://www.ssp.sh/brain/semantic-layer) / [Metrics Layer](https://www.ssp.sh/brain/metrics-layer)
- Physical transformation (e.g., [dbt](https://www.ssp.sh/brain/dbt))
- [Storage Layer](https://www.ssp.sh/brain/storage-layer)
- Analytics and Machine Learning
- Additional Elements:
- [Data Catalog](https://www.ssp.sh/brain/data-catalog)
- Reverse ETL
- General Foundations (Undercurrents):
- Data Security
- Data Management
- [Data Modeling](https://www.ssp.sh/brain/data-modeling) (e.g., [Dimensional Modeling](https://www.ssp.sh/brain/dimensional-modeling))
- Data Quality, Observability, Monitoring (Governance)
- [Data Engineering Architecture](https://www.ssp.sh/brain/data-engineering-architecture)
- [Orchestration](https://www.ssp.sh/brain/data-orchestrators)
- Software Engineering

View File

@ -0,0 +1,44 @@
# Data Lake
Last updated Dec 8, 2023
Table of Contents
1. [Why Do You Need a Data Lake?](https://www.ssp.sh/brain/data-lake/#why-do-you-need-a-data-lake)
2. [Adding Database and ML features](https://www.ssp.sh/brain/data-lake/#adding-database-and-ml-features)
A Data Lake is a versatile storage system, found within the [Storage Layer](https://www.ssp.sh/brain/storage-layer), containing a vast array of both unstructured and structured data. This data is stored without a predetermined purpose, allowing for flexibility and scalability. Data Lakes can be built using a variety of technologies, including Hadoop, NoSQL, Amazon Simple Storage Service, and relational databases, and they accommodate diverse data formats such as Excel, CSV, Text, Logs, and more.
The concept of a data lake, as detailed in the [Hortonworks Data Lake Whitepaper](http://hortonworks.com/wp-content/uploads/2014/05/TeradataHortonworks_Datalake_White-Paper_20140410.pdf), emerged from the need to capture and leverage new types of enterprise data. Early adopters found that significant insights could be gleaned from applications specifically designed to utilize this data. Key capabilities of a data lake include:
- Capturing and storing raw data at scale affordably
- Housing various data types in a unified repository
- Allowing data transformations for undefined purposes
- Facilitating new data processing methods
- Supporting focused analytics for specific use cases
[
## # Why Do You Need a Data Lake?
](https://www.ssp.sh/brain/data-lake/#why-do-you-need-a-data-lake)
A data lake serves as a comprehensive storage solution, employing [Data Lake File Formats](https://www.ssp.sh/brain/data-lake-file-formats) and various [Data Lake Table Format](https://www.ssp.sh/brain/data-lake-table-format)s to manage extensive volumes of unstructured and semi-structured data. As a primary destination for a growing assortment of exploratory and operational data, it caters to a broad spectrum of users, ranging from technical experts to business analysts, for diverse analytical and machine learning purposes.
The data lake model circumvents the limitations of traditional BI tools proprietary formats, offering direct data loading capabilities. This shift eliminates the time-consuming construction and maintenance of complex ETL pipelines and expedites data access, significantly reducing waiting times.
Early adopters of data lakes have demonstrated their efficacy in making data readily available and extractable for business insights. A data lakes architecture enables efficient data storage and versatile transformations, facilitating swift iteration and exploration of business value on an ad-hoc basis.
Data lakes, as initially proposed in the [2014 Data Lake paper](http://hortonworks.com/wp-content/uploads/2014/05/TeradataHortonworks_Datalake_White-Paper_20140410.pdf), can be constructed using various technologies and support multiple data formats, including Excel, CSV, Text, Logs, Apache Parquet, and [Apache Arrow](https://www.ssp.sh/brain/apache-arrow).
The foundation of every data lake is a basic storage provider like AWS S3 or Azure Blob, which is then enhanced with essential database-like features, further discussed in this article.
[
## # Adding Database and ML features
](https://www.ssp.sh/brain/data-lake/#adding-database-and-ml-features)
If you want to reach the next level of the data lake, you can build a [Data Lakehouse](https://www.ssp.sh/brain/data-lakehouse), that mostly uses advanced features from the [Data Lake Table Format](https://www.ssp.sh/brain/data-lake-table-format)s. I also wrote a deep dive in [Data Lake / Lakehouse Guide: Powered by Data Lake Table Formats (Delta Lake, Iceberg, Hudi)](https://ssp.sh/blog/data-lake-lakehouse-guide/).

View File

@ -0,0 +1,156 @@
# Data Modeling
Last updated Dec 31, 2023
Table of Contents
1. [Different Levels](https://www.ssp.sh/brain/data-modeling/#different-levels)
1. [Different Data Modeling Techniques](https://www.ssp.sh/brain/data-modeling/#different-data-modeling-techniques)
2. [(Design) Patterns](https://www.ssp.sh/brain/data-modeling/#design-patterns)
3. [Data Modeling is changing](https://www.ssp.sh/brain/data-modeling/#data-modeling-is-changing)
4. [Tools](https://www.ssp.sh/brain/data-modeling/#tools)
5. [Frameworks](https://www.ssp.sh/brain/data-modeling/#frameworks)
6. [Difference to [[Dimensional Modeling]]](https://www.ssp.sh/brain/data-modeling/#difference-to-dimensional-modeling)
7. [Data Modeling part of Data Engineering?](https://www.ssp.sh/brain/data-modeling/#data-modeling-part-of-data-engineering)
Data modeling has changed over time; when I started (~20 years ago), choosing between Inmon and Kimball was common.
Today, in the context of data engineering, data modeling creates a structured representation of your organizations data. Often illustrated visually, this representation helps understand the relationships, constraints, and patterns within the data and serves as a blueprint for gaining business value in designing data systems, such as data warehouses, lakes, or any analytics solution.
In its most straightforward form, data modeling is how we design the flow of our data such that it flows as efficiently and in a structured way, with good data quality and as little redundancy as possible.
---
Data Modeling is as much about [Data Engineering Architecture](https://www.ssp.sh/brain/data-engineering-architecture) as it is about modeling the data only. Therefore besides the below links, many approaches and common architecture you can find in [Data Engineering Architecture](https://www.ssp.sh/brain/data-engineering-architecture#common-architectures).
Its getting more about language than really modeling, Shane Gibson says on [Making Data Modeling Accessible](https://open.spotify.com/episode/4DNyy4cIttEFMUEWjKEHqV?si=748743c87c2a4d0e). For example, a Data Scientist speaks [Wide Tables](https://www.ssp.sh/brain/one-big-table), a Data engineer talks about facts and dimensions, etc., its what I call the different levels of data modeling in Data Modeling The Unsung Hero of Data Engineering- An Introduction to Data Modeling (Part 1).
[
## # Different Levels
](https://www.ssp.sh/brain/data-modeling/#different-levels)
How do you think about different levels of modeling? Generally, when I started (20 years ago) it was common to choose between Inmon and Kimball. But today, there are so many layers, levels, and approaches. Did you find a good way of separating or naming the different “levels” (still not sure about levels) to make it clear what is meant? Below I collected a list of what I think so far (I also wrote extensively about, in case of interest).
- **Levels of Modeling**
- Generation or source database design
- Data integration
- ETL processes
- Data warehouse schema creation
- Data lake structuring
- BI tool presentation layer design
- Machine learning or AI feature engineering
- **Data Modeling Approaches**
- Conceptual, Logical to physical Data Models
- Other lesser known: Hierarchical Data Modeling, Network Data Modeling and Object-Role Modeling
- **Data Modeling Techniques**
- [Dimensional Modeling](https://www.ssp.sh/brain/dimensional-modeling)
- Data Vault Modeling
- [Anchor Modeling](https://www.ssp.sh/brain/anchor-modeling)
- [Bitemporal Modeling](https://www.ssp.sh/brain/bitemporal-modeling)
- [Entity-Centric Data Modeling (ECM)](https://www.ssp.sh/brain/entity-centric-data-modeling-ecm)
- [Focal Modeling](https://www.ssp.sh/brain/focal-modeling)
- [Activity Schema](https://www.ssp.sh/brain/activity-schema)
- **Data Architecture Pattern**
- General Purpose Data Architecture Pattern
- Staging, Cleansing, Core, Data Mart ([Classical Architecture of Data Warehouse](https://www.ssp.sh/brain/classical-architecture-of-data-warehouse)) or Medallion Architecture
- Specialized
- Batch vs. Streaming (Streaming vs Batch in Orchestration)
- Data Lake/Lakehouse vs. Data Warehouse Pattern
- [Semantic Layer](https://www.ssp.sh/brain/semantic-layer) (In-memory vs. Persistence or Semantic vs. Transformation Layer)
- [Modern Data Stack](https://www.ssp.sh/brain/modern-data-stack) / [Open Data Stack](https://www.ssp.sh/brain/open-data-stack) Pattern
- many more: Data Modeling- The Unsung Hero of Data Engineering- Data Architecture Pattern, Tools and The Future (Part 3)
[LinkedIn Post and Discussion](https://www.linkedin.com/posts/sspaeti_datamodeling-dataarchitecture-dataengineering-activity-7075390406099652609-sUfh?utm_source=share&utm_medium=member_desktop) and [dbt Slack](https://getdbt.slack.com/archives/C0VLZPLAE/p1686903398031609). Links (from post): Data Model Matrix.
[
### # Different Data Modeling Techniques
](https://www.ssp.sh/brain/data-modeling/#different-data-modeling-techniques)
![](https://www.ssp.sh/brain/normalization%E2%80%94denormalization-illustration.png)
Nice illustration how different modeling techniques work | Source: [GitHub - Data-Engineer-Camp/dbt-dimensional-modelling: Step-by-step tutorial on building a Kimball dimensional model with dbt](https://github.com/Data-Engineer-Camp/dbt-dimensional-modelling)
Or other data modeling techniques ( [my Tweet](https://twitter.com/sspaeti/status/1707638116360589730))
- Enterprise Data Warehouse (Inmon)
- Star Schema (Kimball)
- Data Vault
- One Big Table (OBT)
![](https://www.ssp.sh/brain/dwh-inmonn-vs-star-schema-vs-data-vault-vs-one-big-table.png)
Source: [Data Modeling in the Modern Data Stack | Towards Dev](https://towardsdev.com/data-modeling-in-the-modern-data-stack-d29be964b3a7)
[
## # (Design) Patterns
](https://www.ssp.sh/brain/data-modeling/#design-patterns)
Common approaches are well explained [here](https://youtu.be/IdCmMkQLvGA?t=153):
- [Dimensional Modeling](https://www.ssp.sh/brain/dimensional-modeling)
- [Data Lake File Format](https://www.ssp.sh/brain/data-lake-file-formats) -> [Data Lake Table Format](https://www.ssp.sh/brain/data-lake-table-format)
- [Relational Model](https://www.ssp.sh/brain/relational-model)
- Graph Data Modeling ?
others
- streaming vs batch processing
- RW Data Pipeline Design Patterns - 1. Data Flow Patterns · Start Data Engineering (Nice Visual)
[
## # Data Modeling is changing
](https://www.ssp.sh/brain/data-modeling/#data-modeling-is-changing)
See [Data Modeling is changing](https://www.ssp.sh/brain/data-modeling-is-changing).
[
## # Tools
](https://www.ssp.sh/brain/data-modeling/#tools)
See Data Modeling Tools or Data Modeling- The Unsung Hero of Data Engineering- Data Architecture Pattern, Tools and The Future (Part 3) .
[
## # Frameworks
](https://www.ssp.sh/brain/data-modeling/#frameworks)
- BEAM from Agile Data Warehouse Design (Lawrence Corr, Jim Stagnitto)
- ADAPT for [OLAP](https://www.ssp.sh/brain/olap) cubes
- …
[](https://www.ssp.sh/brain/data-modeling/#difference-to-dimensional-modeling)
## [# Difference to](https://www.ssp.sh/brain/data-modeling/#difference-to-dimensional-modeling) [Dimensional Modeling](https://www.ssp.sh/brain/dimensional-modeling)
![](https://www.ssp.sh/brain/Data%20Modeling%20%E2%80%93%20The%20Unsung%20Hero%20of%20Data%20Engineering-%20Modeling%20Techniques%20and%20Data%20Architecture%20Patterns%20(Part%202)#Data%20Modeling%20vs.%20Dimensional%20Modeling)
There is more than dimensional modeling:
- hierarchies, semistructured sources, conformed dimensions, historical updates, and the logic used to keep them up to date
- Source: [Serge Gershkovich on LinkedIn](https://www.linkedin.com/feed/update/urn:li:activity:6993236783610114048?commentUrn=urn%3Ali%3Acomment%3A%28activity%3A6993236783610114048%2C6993250097132130304%29&replyUrn=urn%3Ali%3Acomment%3A%28activity%3A6993236783610114048%2C6993251854910406657%29&dashCommentUrn=urn%3Ali%3Afsd_comment%3A%286993250097132130304%2Curn%3Ali%3Aactivity%3A6993236783610114048%29&dashReplyUrn=urn%3Ali%3Afsd_comment%3A%286993251854910406657%2Curn%3Ali%3Aactivity%3A6993236783610114048%29)
[
## # Data Modeling part of Data Engineering?
](https://www.ssp.sh/brain/data-modeling/#data-modeling-part-of-data-engineering)
Data modeling, incredibly [Dimensional Modeling](https://www.ssp.sh/brain/dimensional-modeling) with defining facts and dimensions, is a big thing for a data engineer, IMO. It would help if you asked vital questions to optimize for data consumers. Do you want to drill down the different products? Daily or monthly enough —keywords granularity and rollup.
It also lets you think about Big-O implications regarding how often you touch and transfer data. Id recommend the old  [Data Warehouse Toolkit](https://www.kimballgroup.com/data-warehouse-business-intelligence-resources/books/data-warehouse-dw-toolkit/) from Ralph Kimball, which initiated many of these concepts and is still applicable today. Mostly its not done in the beginning, but as soon as you get bigger, you wish you had done more :)
Links:
- [Honest No-BS Data Modeling w/ Juan Sequeda](https://www.linkedin.com/video/event/urn:li:ugcPost:6994285263711596545/)
- [What Is Data Modeling? - DATAVERSITY](https://www.dataversity.net/what-is-data-modeling/)
- [Modern Data Modeling Beyond The Theory - With Veronika Durgin - YouTube](https://youtu.be/3P3wMCYTQJc)

View File

@ -0,0 +1,2 @@
https://quii.gitbook.io/learn-go-with-tests/go-fundamentals/concurrency

View File

@ -0,0 +1,89 @@
Dependency Injection (DI) in Go (Golang) follows the same basic principles as in other programming languages, but with a focus on simplicity and explicitness, given Go's minimalist design. DI is used to decouple the creation of an object's dependencies from its behavior, making the code more modular, testable, and maintainable.
### Step-by-Step Explanation of Dependency Injection in Go
1. **Define Interfaces for Dependencies:**
- Identify the components of your application that can be abstracted behind interfaces. These interfaces represent the contracts that your components must fulfill.
- Example: If a component needs to access a database, define an interface for database operations.
2. **Implement the Dependencies:**
- Create concrete implementations of these interfaces. These implementations are the actual dependencies that will be injected.
- Example: Implement the database interface with a MySQL or PostgreSQL driver.
3. **Design Your Components to Receive Dependencies:**
- Write your components (such as structs and their methods in Go) to receive dependencies through their constructors or setter methods. This is where the actual injection happens.
- Example: A service struct takes a database interface implementation as a parameter in its constructor.
4. **Create a Centralized Place for Dependency Construction (Optional):**
- While not a requirement, some applications benefit from having a central place where dependencies are constructed. This can be a main function or a special factory function.
- This step often involves "wiring up" your application by creating the concrete dependencies and passing them to the components that need them.
5. **Inject the Dependencies:**
- Instantiate your dependencies and inject them into the components that require them.
- Example: In your `main` function, create a database connection and pass it to the service constructor.
6. **Use the Components:**
- With dependencies injected, your components are ready to be used. The key point here is that the components are not aware of how their dependencies are created, making them easier to test and maintain.
### Example in Go
Here's a simple example to illustrate DI in Go:
```go
package main
import "fmt"
// Database is an interface for database operations
type Database interface {
Query(query string) string
}
// MySQL implements Database interface
type MySQL struct {}
func (db MySQL) Query(query string) string {
// Implementation for MySQL query
return "MySQL result for " + query
}
// Service uses a Database dependency
type Service struct {
db Database
}
func NewService(db Database) *Service {
return &Service{db: db}
}
func (s *Service) PerformAction(query string) string {
// Use the database dependency
return s.db.Query(query)
}
func main() {
// Dependency is created and injected here
mysql := MySQL{}
service := NewService(mysql)
// Use the service
result := service.PerformAction("SELECT * FROM users")
fmt.Println(result)
}
```
In this example:
- `Database` is an interface that abstracts database operations.
- `MySQL` is a concrete implementation of the `Database` interface.
- `Service` is a struct that depends on the `Database` interface. It receives its dependency through its constructor `NewService`.
- In the `main` function, we create a `MySQL` instance and inject it into a new `Service` instance.
This structure makes it easy to swap out the `MySQL` implementation with another implementation of the `Database` interface, such as a mock database for testing, without changing the `Service` code.
[[Mocking]]
https://quii.gitbook.io/learn-go-with-tests/go-fundamentals/dependency-injection

184
content/Mocking.md Normal file
View File

@ -0,0 +1,184 @@
Mocking in Go (Golang) is a technique used in testing to replace real objects with controlled replacements. These replacements are known as "mocks," which simulate the behavior of real objects. Mocking is particularly useful for isolating code from external dependencies like databases, APIs, or complex logic, making it easier to test individual components (units) of the software.
### Step-by-Step Explanation of Mocking in Go
1. **Identify Dependencies for Mocking:**
- Find the external dependencies in the component you want to test. These could be interfaces that your component interacts with, like a database access layer or a third-party service.
2. **Define Interfaces:**
- If not already done, define Go interfaces for these dependencies. Your component should interact with these interfaces rather than concrete implementations. This is crucial for mocking.
3. **Create Mock Objects:**
- Implement mock versions of these interfaces. These mock objects will mimic the behavior of real objects but in a controlled way, suitable for testing.
- You can write these manually or use a mocking framework like `gomock` or `moq` to generate them.
4. **Inject Mocks into Your Component:**
- Instead of using real objects, inject the mock objects into your component during testing. This is typically done in your test setup.
5. **Define Expected Behavior and Responses:**
- Configure your mocks to return specific responses or behave in certain ways when their methods are called. This often involves setting up expectations, return values, and possibly tracking how many times a method is called.
6. **Write Your Tests:**
- Write tests for your component as you normally would. The difference is that when your component interacts with its dependencies, it's actually using the mock objects.
7. **Assert and Verify:**
- After running your tests, assert the outputs of your component. Additionally, verify that the interactions with the mock objects happened as expected.
### Example in Go
Let's assume you have a service that interacts with a database, and you want to test the service without relying on a real database.
First, define an interface for your database:
```go
type Database interface {
GetUser(id string) (*User, error)
}
```
Then, implement the service:
```go
type UserService struct {
db Database
}
func (s *UserService) GetUser(id string) (*User, error) {
return s.db.GetUser(id)
}
```
Now, create a mock for the `Database` interface. Here's a simple manual mock:
```go
type MockDatabase struct {
GetUserFunc func(string) (*User, error)
}
func (m *MockDatabase) GetUser(id string) (*User, error) {
return m.GetUserFunc(id)
}
```
Write a test using the mock:
```go
func TestUserService_GetUser(t *testing.T) {
// Create a mock instance
mockDB := &MockDatabase{
GetUserFunc: func(id string) (*User, error) {
return &User{ID: id, Name: "MockUser"}, nil
},
}
// Inject the mock into your service
userService := &UserService{db: mockDB}
// Call the method you want to test
user, err := userService.GetUser("123")
if err != nil {
t.Fatalf("unexpected error: %v", err)
}
// Assertions
if user.ID != "123" {
t.Errorf("expected user ID 123, got %s", user.ID)
}
if user.Name != "MockUser" {
t.Errorf("expected user name MockUser, got %s", user.Name)
}
}
```
In this test:
- `MockDatabase` is a mock implementation of the `Database` interface.
- `GetUserFunc` is a field in the mock struct that you can customize per test.
- You inject `mockDB` into `UserService` and then call the method you want to test.
- Finally, you assert that `UserService.GetUser` behaves as expected when `Database.GetUser` returns specific values.
Mocking with dependency injection in Go involves a few key steps: defining interfaces for your dependencies, implementing mocks for these interfaces, and then injecting these mocks into the components you are testing. Let's go through this process step by step with an example.
### Step 1: Define Interfaces
First, define interfaces for the dependencies in your application. This allows you to abstract away concrete implementations and makes it easier to swap them out for mocks during testing.
```go
// Database is an interface for interacting with a database
type Database interface {
FetchData(query string) (string, error)
}
```
### Step 2: Implement Your Component
Implement your component to depend on these interfaces. This is where dependency injection comes into play. Your component will receive its dependencies (in this case, the `Database` interface) typically through constructor injection.
```go
type DataService struct {
db Database
}
func NewDataService(db Database) *DataService {
return &DataService{db: db}
}
func (s *DataService) GetData(query string) (string, error) {
return s.db.FetchData(query)
}
```
### Step 3: Create Mock Implementations
Create mock implementations of your interfaces. These mocks will simulate the behavior of real dependencies in a controlled way for testing purposes. You can manually write these mocks or use a mocking library like `gomock`.
```go
// MockDatabase is a mock implementation of the Database interface
type MockDatabase struct {
FetchDataFunc func(string) (string, error)
}
func (m *MockDatabase) FetchData(query string) (string, error) {
return m.FetchDataFunc(query)
}
```
### Step 4: Write Tests with Mocks
When writing tests for your component, create instances of your mocks and configure their behavior. Then, inject these mocks into the component you are testing.
```go
func TestDataService_GetData(t *testing.T) {
// Set up the mock
mockDB := &MockDatabase{
FetchDataFunc: func(query string) (string, error) {
return "mock data", nil
},
}
// Inject the mock into your service
dataService := NewDataService(mockDB)
// Test the GetData method
data, err := dataService.GetData("SELECT * FROM table")
if err != nil {
t.Fatalf("unexpected error: %v", err)
}
// Assertions
if data != "mock data" {
t.Errorf("expected 'mock data', got '%s'", data)
}
}
```
### Summary
In this process, you're taking advantage of dependency injection to easily swap real dependencies with mocks during testing. This approach enhances the testability of your code, allowing you to test components in isolation without relying on external systems like databases. It's a powerful technique in Go, particularly because Go's interface system makes it straightforward to create and use mocks.
https://quii.gitbook.io/learn-go-with-tests/go-fundamentals/mocking

View File

@ -0,0 +1,82 @@
# Modern Data Stack
Last updated Dec 7, 2023
- [Dataquestion](https://www.ssp.sh/brain/tags/dataquestion/)
Table of Contents
1. [Why the Modern Data Stack?](https://www.ssp.sh/brain/modern-data-stack/#why-the-modern-data-stack)
2. [Integrating with [[Dagster]]](https://www.ssp.sh/brain/modern-data-stack/#integrating-with-dagster)
3. [A comment I made on Social](https://www.ssp.sh/brain/modern-data-stack/#a-comment-i-made-on-social)
4. [Further Links](https://www.ssp.sh/brain/modern-data-stack/#further-links)
5. [[[Counter Arguments against Modern Data Stack]]](https://www.ssp.sh/brain/modern-data-stack/#counter-arguments-against-modern-data-stack)
The Modern Data Stack (MDS) comprises a suite of open-source tools designed for end-to-end analytics. This includes data ingestion, transformation, machine learning, and integration into a columnar data warehouse or lake solution, all complemented by an analytics BI dashboard backend. The stacks versatility allows extensions for data quality, data cataloging, and more.
MDS aims to enable data insights using the best-suited tools for each process. Its worth noting that “Modern Data Stack” is a relatively new term, with its definition still evolving.
> Synonym Names
> A burgeoning term, [ngods (new generation open-source data stack)](https://www.ssp.sh/brain/ngods-new-generation-open-source-data-stack), has emerged. Previously, Ive referred to this concept as the Open Data Stack Project. Additionally, Dagster introduced the term DataStack 2.0 in a [recent blog post](https://dagster.io/blog/evolution-iq-case-study). [Open Data Stack](https://www.ssp.sh/brain/open-data-stack) is my own definition of it.
> Closed Source vs Open Source
> Closed Source examples: dbt, Looker, Snowflake, Fivetran, Hightouch, Census
> Open Source alternatives: airbyte, dbt, dagster, Superset, Reverse-ETL?
> Modern Data Stack on a Laptop
> [DuckDB: Modern Data Stack in a Box](https://duckdb.org/2022/10/12/modern-data-stack-in-a-box.html)
[
## # Why the Modern Data Stack?
](https://www.ssp.sh/brain/modern-data-stack/#why-the-modern-data-stack)
A perspective from [Reddit](https://www.reddit.com/r/dataengineering/comments/12acdrk/comment/jes4pr8/?utm_source=share&utm_medium=web2x&context=3) highlights the shift in data warehousing and analytics. It underscores the reduced need for extensive teams and infrastructure, thanks to new tools that streamline data management and reporting. Particularly for small and mid-sized companies, MDS offers a competitive edge in data handling, allowing even a single data engineer to manage vast datasets efficiently.
---
A notable article discussing Lakehouse, Metrics Layer, and Clickhouse:
[The Next Cloud Data Platform | Greylock](https://greylock.com/greymatter/the-next-cloud-data-platform/)
![](https://www.ssp.sh/brain/Pasted%20image%2020220527120559.png)
[](https://www.ssp.sh/brain/modern-data-stack/#integrating-with-dagster)
## [# Integrating with](https://www.ssp.sh/brain/modern-data-stack/#integrating-with-dagster) Dagster
The downside of MDS is the unbundling of Bundling vs Unbundling- Monolith Data vs Microservices, but Dagster helps integrate the full data stack together:
![](https://www.ssp.sh/brain/Pasted%20image%2020220428103513.png)
![](https://www.ssp.sh/brain/Pasted%20image%2020220428103632.png)
Dagster elevates the Modern Data Stack:
![](https://www.ssp.sh/brain/Pasted%20image%2020220428103934.png)
![](https://www.ssp.sh/brain/Pasted%20image%2020220428103939.png)
Explore more about its power with Dagster and [Data Assets](https://www.ssp.sh/brain/data-assets).
[
## # A comment I made on Social
](https://www.ssp.sh/brain/modern-data-stack/#a-comment-i-made-on-social)
I often ponder over the ideal tools for a data stack. My preference leans toward a [Cloud Data Warehouse](https://www.sspaeti.com/blog/olap-whats-coming-next/#cloud-data-warehouses) such as [](https://www.firebolt.io/)Firebolt, [Snowflake](https://www.snowflake.com/), [BigQuery](https://cloud.google.com/bigquery/), [Redshift](https://aws.amazon.com/redshift/), or [Synapse](https://azure.microsoft.com/en-us/services/synapse-analytics/), as a starting point.
The journey typically begins with [Airbyte](https://airbyte.com/) for data integration, followed by SQL-based transformation with [dbt](https://www.getdbt.com/). Orchestrating the processes in [Python](https://www.sspaeti.com/blog/business-intelligence-meets-data-engineering/#8220use-python-and-sql-if-possible8221) with tools like [dagster](https://ask.astorik.com/c/resources-sspaeti/what-are-common-alternatives-to-apache-airflow) is crucial.
From there, I would integrate additional open-source tools based on specific needs: [Spark](https://spark.apache.org/) for processing, [Delta Lake](https://delta.io/) for data lake formatting and ACID Transactions, [Amundsen](http://amundsen.io/) for data cataloging, and [Great Expectation](https://greatexpectations.io/) for data quality, among others. For smaller projects, [DuckDB](https://duckdb.org/) is suitable for local [OLAP](https://www.ssp.sh/brain/olap) scenarios, while [Kubernetes](https://kubernetes.io/) and DevOps provide scalability.
For teams without data engineering resources, closed-source options like [Ascend](https://www.ascend.io/) or [Foundry](https://www.palantir.com/platforms/foundry/) are viable alternatives.
Feel free to reach out for further discussion or clarifications.
[
## # Further Links
](https://www.ssp.sh/brain/modern-data-stack/#further-links)
- [The next layer of the modern data stack](https://www.getdbt.com/blog/next-layer-of-the-modern-data-stack/)
- [What Is the Modern Data Stack? | Fivetran](https://www.fivetran.com/blog/what-is-the-modern-data-stack)
[](https://www.ssp.sh/brain/modern-data-stack/#counter-arguments-against-modern-data-stack)
## [#](https://www.ssp.sh/brain/modern-data-stack/#counter-arguments-against-modern-data-stack) Counter Arguments against Modern Data Stack

View File

@ -64,4 +64,161 @@ For helper functions, it's a good idea to accept a `testing.TB` which is an inte
[[Property Based Testing]]
[[examples in golang]]
[[examples in golang]] ==> can help you with test
# Benchmarking
https://quii.gitbook.io/learn-go-with-tests/go-fundamentals/iteration#benchmarking
run with test ==> `go test -bench=.`
Go's built-in testing toolkit features a [coverage tool](https://blog.golang.org/cover). Whilst striving for 100% coverage should not be your end goal, the coverage tool can help identify areas of your code not covered by tests. If you have been strict with TDD, it's quite likely you'll have close to 100% coverage anyway.
Try running
`go test -cover`
You should see
PASS
coverage: 100.0% of statements
Go does not let you use equality operators with slices. You _could_ write a function to iterate over each `got` and `want` slice and check their values but for convenience sake, we can use [`reflect.DeepEqual`](https://golang.org/pkg/reflect/#DeepEqual) which is useful for seeing if _any_ two variables are the same.
In Go, slices cannot be compared directly using the equality operator (`==`). This is because slices are reference types in Go, and comparing them directly would compare the references (addresses in memory), not the content of the slices. To compare the contents of two slices, you need to iterate over the elements and compare them individually, or use a convenience function like `reflect.DeepEqual` from the `reflect` package.
### Using `reflect.DeepEqual`
`reflect.DeepEqual` is a function that checks if two variables are deeply equal. It's a part of the `reflect` package in Go, which provides run-time reflection, allowing you to inspect and manipulate objects at run time. `reflect.DeepEqual` is useful for comparing complex types like slices, maps, structs, etc., where a simple `==` comparison is not possible or sufficient.
Here's an example of how you can use `reflect.DeepEqual` to compare two slices:
```go
package main
import (
"fmt"
"reflect"
)
func main() {
slice1 := []int{1, 2, 3}
slice2 := []int{1, 2, 3}
slice3 := []int{4, 5, 6}
fmt.Println("slice1 == slice2:", reflect.DeepEqual(slice1, slice2)) // true
fmt.Println("slice1 == slice3:", reflect.DeepEqual(slice1, slice3)) // false
}
```
In this example, `reflect.DeepEqual` is used to compare `slice1` with `slice2` and `slice3`. It returns `true` when comparing `slice1` and `slice2` because their contents are identical, and `false` when comparing `slice1` and `slice3` as their contents differ.
### Writing Your Own Comparison Function
If you prefer not to use `reflect.DeepEqual` (for example, for performance reasons, as `reflect.DeepEqual` can be slower and more memory-intensive), you can write your own function to compare slices:
```go
package main
import "fmt"
func slicesEqual(a, b []int) bool {
if len(a) != len(b) {
return false
}
for i := range a {
if a[i] != b[i] {
return false
}
}
return true
}
func main() {
slice1 := []int{1, 2, 3}
slice2 := []int{1, 2, 3}
slice3 := []int{4, 5, 6}
fmt.Println("slice1 == slice2:", slicesEqual(slice1, slice2)) // true
fmt.Println("slice1 == slice3:", slicesEqual(slice1, slice3)) // false
}
```
This `slicesEqual` function first checks if the slices have the same length. If not, it returns `false`. Then it iterates over the slices, comparing each element. If any elements are different, it returns `false`. If all elements are the same, it returns `true`.
Using `reflect.DeepEqual` is more convenient and less error-prone, especially for complex types or deeply nested data structures. However, for simple cases or when performance is a concern, a custom comparison function may be more efficient.
## Table Driven Test in golang
Table-driven testing is a popular testing pattern in Go (Golang), particularly well-suited for scenarios where you want to run the same test logic across different inputs and expected outputs. This approach is efficient, readable, and makes it easy to add new test cases.
### Concept
In table-driven testing, you define a table (slice) of test cases. Each test case in the table is a struct that includes the input data for the test and the expected result. You then iterate over this slice, running the same test logic for each case.
This approach is especially useful for:
- Reducing code duplication.
- Making it easier to add new test cases.
- Improving test readability and maintenance.
### Example
Let's consider an example where we want to test a function `Add` that adds two integers.
```go
package main
import "testing"
// Add returns the sum of two integers
func Add(a, b int) int {
return a + b
}
// TestAdd is a table-driven test for the Add function
func TestAdd(t *testing.T) {
tests := []struct {
name string
a, b int
want int
}{
{"add two positive numbers", 2, 3, 5},
{"add positive and negative", 1, -1, 0},
{"add two negative numbers", -1, -2, -3},
// add more test cases here
}
for _, tt := range tests {
t.Run(tt.name, func(t *testing.T) {
got := Add(tt.a, tt.b)
if got != tt.want {
t.Errorf("Add(%d, %d) = %d; want %d", tt.a, tt.b, got, tt.want)
}
})
}
}
```
### Explanation
- **Test Table**: The `tests` slice contains multiple test cases. Each case is defined by a struct with fields for inputs (`a` and `b`), the expected output (`want`), and a name (`name`) for the test case.
- **Iteration**: The `for _, tt := range tests` loop iterates over each test case.
- **Running the Test**: Inside the loop, `t.Run` is used to execute a subtest for each case. This allows each test case to be run independently, and it provides clearer test output, showing which cases pass or fail.
- **Assertions**: The actual function call (`Add(tt.a, tt.b)`) and the assertion (`if got != tt.want`) are inside the loop. If the function's output doesn't match the expected output, an error is reported with `t.Errorf`.
By using table-driven tests, you can easily see the different scenarios being tested and add new test cases by simply adding new entries to the `tests` slice. This pattern makes your tests more organized and easier to extend and maintain.
[[Dependency Injection]]
[[Mocking]]
[[Concurrency Testing in Golang]]
https://quii.gitbook.io/learn-go-with-tests/go-fundamentals/dependency-injection
https://quii.gitbook.io/learn-go-with-tests/go-fundamentals/mocking