User Details
- User Since
- Jun 21 2021, 2:34 PM (158 w, 3 d)
- Availability
- Available
- LDAP User
- TChin
- MediaWiki User
- TChin (WMF) [ Global Accounts ]
Fri, Jun 14
I don't think so. The image suggestion work on Flink never progressed passed the original ticket.
Tue, Jun 11
Getting rid of service-scaffold-node also means we should get rid of servicelib-node since it was created for service-scaffold node (this is the reason why service-scaffold node depends on packages that don't exist. This project never finished)
Jun 3 2024
May 29 2024
Don't forget that any CI that has a production deployment pipeline needs the repo to be added to trusted runners and also have their tags protected (Slack thread on protecting tags)
This might be harder than I thought. Creating a dummy google account to act as the receiver seems off the table. All of Google's APIs require OAuth or some manual way for the user to sign in. There is no way to make a pure bot account, and also no good way to automate login without being slapped by a ban.
May 28 2024
May 21 2024
Sounds good to me. service-scaffold-node was started to turn service-template-node into a group of libraries and is basically superseded by my effort to replace service-runner (T360924) which is mostly completed
May 15 2024
Would be nice to get a confirmation for archiving node-rdkafka-statsd since it'll progress T349118
May 13 2024
Apr 17 2024
Has this project been discussed across the WMF/Community?
It would be great if there was a RFC process, but there has at least been discussions about what to do with service-runner and this project is on the radar to the entirety of Data Platform Engineering and some people on the MW engineering team and the language team. It was also posted on slack on #engineering-all to give people a head's up just in case there was another team working on something similar. If there's one thing I'm sure about is that the consensus is that we need a replacement, whether or not this is it.
Apr 5 2024
Config store repo does CI checks for jsonschema correctness and config values against its jsonschema. The Datasets Config service repo has dockerized CI using Kokkuri and Blubber.
Mar 26 2024
Feb 27 2024
If it's to a point where we even need to use a new name, might as well break everything. I'd love to join in on the fun
Feb 11 2024
Looking at the logs, this seems to coincide with the redaction patch to eventstreams, but looking at the code I'm having a hard time finding where a memory leak could've happened... more confusing that it's just 1 or 2 pods hitting the limit
Jan 30 2024
Jan 22 2024
Using lz4 compression works but checking it with parquet-tools doesn't. I see something like compression: UNKNOWN (space_saved: -25%) Seems like a known issue.
Jan 5 2024
INSERT OVERRIDE with PARTITION also doesn't work anymore because Iceberg uses hidden partitioning so had to enable Spark's dynamic overwrite
https://iceberg.apache.org/docs/latest/spark-writes/#insert-overwrite
TIL when setting the compression codec to snappy, Iceberg doesn't end the files in hdfs with .snappy.parquet. I had to check if the format was correct using parquet-tools.
Dec 19 2023
Tested to see if the COALESCE hints still work in Iceberg by creating 2 tables and filling then with/without the hint. It still seems to work.
Dec 18 2023
Dec 16 2023
Tested on a stat machine with
CREATE EXTERNAL TABLE IF NOT EXISTS `aqs_hourly`( `cache_status` string COMMENT 'Cache status', `http_status` string COMMENT 'HTTP status of response', `http_method` string COMMENT 'HTTP method of request', `response_size` bigint COMMENT 'Response size', `uri_host` string COMMENT 'Host of request', `uri_path` string COMMENT 'Path of request', `request_count` bigint COMMENT 'Number of requests', `hour` timestamp COMMENT 'The aggregated hour. Covers from minute 00 to 59' ) USING ICEBERG PARTITIONED BY (days(hour)) ;
And
spark3-sql --master yarn --executor-memory 8G --executor-cores 4 --driver-memory 2G --conf spark.dynamicAllocation.maxExecutors=64 \ -f aqs_hourly_iceberg.hql \ -d source_table=wmf.webrequest \ -d webrequest_source=text \ -d destination_table=tchin.aqs_hourly \ -d coalesce_partitions=1 \ -d year=2023 \ -d month=12 \ -d day=3 \ -d hour=0
Dec 14 2023
Dec 11 2023
Dec 2 2023
Nov 14 2023
I think the per-image quota should probably be increased. I tested building a few projects locally and a project with NodeJS and 0 dependencies results in a built image that's 805.58 MB. One with only VueJS as a dependency bumps it up to 858.13 MB. I'm probably not going to be the last one who needs more than 200 MB of working space :/
Nov 13 2023
Example error:
step-export: 2023-11-13T05:41:56.835942824Z ERROR: failed to export: failed to write image to the following tags: [tools-harbor.wmcloud.org/tool-dpe-alerts-dashboard/tool-dpe-alerts-dashboard:latest: PATCH https://tools-harbor.wmcloud.org/v2/tool-dpe-alerts-dashboard/tool-dpe-alerts-dashboard/blobs/uploads/b62dd944-4fad-4ee8-b900-8409f7860d6c?_state=REDACTED: unexpected status code 413 Request Entity Too Large: <html> step-export: 2023-11-13T05:41:56.835973012Z <head><title>413 Request Entity Too Large</title></head> step-export: 2023-11-13T05:41:56.835976984Z <body> step-export: 2023-11-13T05:41:56.835979969Z <center><h1>413 Request Entity Too Large</h1></center> step-export: 2023-11-13T05:41:56.835983468Z <hr><center>nginx/1.18.0</center> step-export: 2023-11-13T05:41:56.836002364Z </body> step-export: 2023-11-13T05:41:56.836005027Z </html> step-export: 2023-11-13T05:41:56.836008032Z ] step-export: step-results: 2023-11-13T05:41:57.433667715Z 2023/11/13 05:41:57 Skipping step because a previous step failed
Oct 26 2023
Oct 11 2023
If we do introduce something, we should use JSDoc3 and follow what's happening on this ticket T138401
Oct 3 2023
Sep 29 2023
DeliveryGuarantee.AT_LEAST_ONCE: The sink will wait for all outstanding records in the Kafka buffers to be acknowledged by the Kafka producer on a checkpoint. No messages will be lost in case of any issue with the Kafka brokers but messages may be duplicated when Flink restarts because Flink reprocesses old input records.
Sep 28 2023
Unaligned checkpoints didn't work. Maybe it's because of data being moved around to new brokers and Kafka is too overloaded.
@bking Gabriele is currently on sick leave but yes let's try incrementing the helm chart version
Sep 19 2023
Aug 31 2023
Associated GitHub PR: https://github.com/wikimedia/jsonschema-tools/pull/48
Aug 29 2023
Seems like in jsonschema-tools the enums are only validated through ajv and their strict union type checking allows null so will have to implement the check ourselves
Aug 28 2023
While adding a workaround to T344235, I noticed that additionalProperties isn't very well represented in DataHub.
"custom_data": { "additionalProperties": { "properties": { "data_type": { "type": "string", "enum": ["number", "string", "boolean", "null"], } } }, "propertyNames": { "maxLength": 255, "minLength": 1, "pattern": "^[$a-z]+[a-z0-9_]*$", }, },
Just shows up in DataHub as a Struct with no defined nested fields (which I guess makes sense, but is not helpful).
Aug 22 2023
From the recent meeting:
- Event Streams will be the name of the platform
- Streams are upstream to Kafka topics
After experimenting a lot, I have a Datahub transformer for Kafka that generates an Event Streams platform, adds description, schema, and path. However, I don't know if it should be a transformer since it's doing a bit more than just transforming.
Aug 18 2023
Aug 16 2023
Since Datahub has the concept of platforms, I think the best way forward is to have a separate platform called Event Streams where the datasets under it are the streams defined in the stream config. We can then keep the Kafka platform for all the individual Kafka topics. Then what we can do is have a transform attached to the current Kafka ingestion recipe that will attach the schemas to the individual topics when supported but also at the same time insert the streams into the Event Streams platform. This way we can have the schemas on both the stream and its topics
Jul 28 2023
Jul 12 2023
On the wiki for schema guidelines there's a blanket statement that all modifications should be backwards compatible - I assume this doesn't apply to major version changes so will note that
Jul 10 2023
Jul 6 2023
Jun 21 2023
I could try taking a crack at it
Jun 20 2023
Jun 14 2023
Jun 13 2023
Jun 8 2023
Jun 6 2023
Jun 5 2023
Is there a benefit to doing this in blubber though?
The cookiecutter template also does this via a post-generation hook
https://gitlab.wikimedia.org/repos/data-engineering/eventutilities-python/-/blob/main/cookiecutter-event-pipeline/hooks/post_gen_project.py
Ok so just recounting my experiments: