SHGA stands for Single Haplotype Genome Assembly. In genetics and genomics, the assembly of genomes from fragmented DNA sequences is a critical task. Traditional genome assembly involves combining DNA sequences (reads) generated by sequencing technologies into longer contiguous sequences (contigs), eventually forming a complete or near-complete genome sequence. However, this process becomes particularly challenging in organisms with complex or highly heterozygous genomes due to the presence of multiple haplotypes.
The SHGA approach focuses on assembling a single haplotype, essentially aiming to reconstruct the genome sequence of a single chromosome (or haplotype) from a heterozygous individual. This can significantly simplify the assembly process and provide valuable information for genetic studies.
Initial analysis suggests this dataset is well-shuffled. There are no apparent sequential biases in the first 10,000 rows, which is excellent for training convergence. However, keep an eye on the class distribution; "sample" datasets often over-represent the minority class to balance training, which might skew real-world performance metrics.
Have you analyzed this specific SHGA release yet? What are your benchmarks looking like? Drop a comment below. shga sample 750k.tar.gz
#DataScience #MachineLearning #Dataset #SecurityResearch #Python #BigData
The file shga_sample_750k.tar.gz is a sample dataset related to the massive Shanghai National Police (SHGA) database breach that surfaced in mid-2022. This breach is historically significant for its scale and the specific types of data it exposed from a government source. Key Features of the Data
Massive Scale: While this specific file is a 750,000-record sample, the full breach was alleged by the seller "ChinaDan" to contain personally identifiable information (PII) on approximately 1 billion Chinese residents. SHGA stands for Single Haplotype Genome Assembly
Diverse Data Types: The records in the sample (and the larger database) reportedly include names, addresses, mobile phone numbers, and national ID numbers.
Sensitive Official Records: Beyond basic contact info, an "interesting feature" noted by researchers is the inclusion of criminal record information and detailed police incident reports, including case summaries dating back several years.
Western Visibility: This incident is notable for being one of the first major Chinese government data leaks to gain significant attention in Western cybersecurity and research circles. If a GPG signature is provided, verify signature with gpg
The sample was originally hosted on platforms like Breached.to (now defunct) and was distributed to verify the authenticity of the seller's claims regarding the much larger dataset. Insights from the Shanghai National Police Database Breach
Working with shga_sample_750k.tar.gz: A Comprehensive Guide
bim <- fread("shga_sample.bim", header=F) colnames(bim) <- c("Chr", "SNP", "cm", "Pos", "A1", "A2") print(paste("Markers:", nrow(bim)))
tar -xvzf shga_sample_750k.tar.gz
This will likely produce files like:
Check contents:
tar -tzf shga_sample_750k.tar.gz | head -20